Unicode Playlists for Sonos from Python - python

I'm working to export a small subset of music from my iTunes library to an external drive, for use with a Sonos speaker (via Media Library on Sonos). All was going fine until I came across some unicode text in track, album and artist names.
I'm moving from iTunes on Mac to a folder structure on Linux (Ubuntu), and the file paths all contain the original Unicode names and these are displayed and play fine from Sonos in the Artist / Album view. The only problem is playlists, which I'm generating via a bit of Python3 code.
Sonos does not appear to support UTF-8 encoding in .m3u / .m3u8 playlists. The character ÷ was interpreted by Sonos as ÷, which after a bit of Googling I found was clearly mixing up UTF-8 and UTF-16 (÷ 0xC3 0xB7 in UTF-8, whilst à is U+00C3 in UTF-16 and · is U+00B7 in UTF-16). I've tried many different ways of encoding it, and just can't get it to recognise tracks with non-standard (non-ASCII?) characters in their names.
I then tried .wpl playlists, and thought I'd solved it. Tracks with characters such as ÷ and • in their path now work perfectly, just using those characters in their unicode / UTF-8 form in the playlist file itself.
However, just as I was starting to tidy up and finish off the code, I found some other characters that weren't being handled correctly: ö, å, á and a couple of others. I've tried both using these as their original unicode characters, but also as their encoded XML identifier e.g. ́ Using this format doesn't make a difference to what works or does not work - ÷ (÷) and • (•) are fine, whilst ö (ö), å (å) and á (á) are not.
I've never really worked with unicode / UTF-8 before, but having read various guides and how-to's I feel like I'm getting close but probably just missing something simple. The fact that some unicode characters work now, and others don't, makes me think it's got to be something obvious! I'm guessing the difference is that accents modify the previous character, rather than being a character in itself, but tried removing the previous letter and that didn't work!
Within Python itself I'm not doing anything particularly clever. I read in the data from iTunes' XML file using:
with open(settings['itunes_path'], 'rb') as itunes_handle:
itunes_library = plistlib.load(itunes_handle)
For export I've tried dozens of different options, but generally something like the below (sometimes with encoding='utf-8' and various other options):
with open(dest_path, 'w') as playlist_file:
playlist_file.write(generated_playlist)
Where generated_playlist is the result of extracting and filtering data from itunes_library, having run urllib.parse.unquote() on any iTunes XML data.
Any thoughts or tips on where to look would be very much appreciated! I'm hoping that to someone who understands Unicode better the answer will be really really obvious! Thanks!
Current version of the code available here: https://github.com/dwalker-uk/iTunesToSonos

With thanks to #lenz for the suggestions above, I do now have unicode playlists fully working with Sonos.
A couple of critical points that should save someone else a lot of time:
Only .wpl playlists seem to work. Unicode will not work with .m3u or .m3u8 playlists on Sonos.
Sonos needs any unicode text to be normalised into NFC form - I'd never heard of this before, but essentially means that any accented characters have to be represented by a single character, not as a normal character with a separate accent.
The .pls playlist, which is an XML format, needs to have unicode characters encoded in an XML format, i.e. é is represented in the .pls file as é.
The .pls file also needs the XML reserved characters (& < > ' ") in their escaped form, i.e & is &.
In Python 3, converting a path from iTunes XML format into something suitable for a .pls playlist on Sonos, needs the following key steps:
left = len(itunes_library['Music Folder'])
path_relative = 'Media/' + itunes_library['Tracks'][track_id]['Location'][left:]
path_unquoted = urllib.parse.unquote(path_relative)
path_norm = unicodedata.normalize('NFC', path_unquoted)
path = path_norm.replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"')
playlist_wpl += '<media src="%s"/>\n' % path
with open(pl_path, 'wb') as pl_file:
pl_file.write(playlist_wpl.encode('ascii', 'xmlcharrefreplace'))
A full working demo for exporting from iTunes for use in Sonos (or anything else) as .pls is available here: https://github.com/dwalker-uk/iTunesToSonos
Hope that helps someone!

Related

Encode a raw string so it can be decoded as json

I am throwing in the towel here. I'm trying to convert a string scraped from the source code of a website with scrapy (injected javascript) to json so I can easily access the data. The problem comes down to a decode error. I tried all kinds of encoding, decoding, escaping, codecs, regular expressions, string manipulations and nothing works. Oh, using Python 3.
I narrowed down the culprit on the string (or at least part of it)
scraped = '{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
scraped_raw = r'{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
data = json.loads(scraped_raw) #<= works
print(data["propertyNotes"])
failed = json.loads(scraped) #no work
print(failed["propertyNotes"])
Unfortunately, I cannot find a way for scrapy/splash to return the string as raw. So, somehow I need to have python interprets the string as raw while it is loading the json. Please help
Update:
What worked for that string was json.loads(str(data.encode('unicode_escape'), 'utf-8')) However, it didnt work with the larger string. The error I get doing this is JSONDecodeError: Invalid \escape on the larger json string
The problem exists because the string you're getting has escaped control characters which when interpreted by python become actual bytes when encoded (while this is not necessarily bad, we know that these escaped characters are control characters that json would not expect). Similar to Turn's answer, you need to interpret the string without interpreting the escaped values which is done using
json.loads(scraped.encode('unicode_escape'))
This works by encoding the contents as expected by the latin-1 encoding whilst interpreting any \u003 like escaped character as literally \u003 unless it's some sort of control character.
If my understanding is correct however, you may not want this because you then lose the escaped control characters so the data might not be the same as the original.
You can see this in action by noticing that the control chars disappear after converting the encoded string back to a normal python string:
scraped.encode('unicode_escape').decode('utf-8')
If you want to keep the control characters you're going to have to attempt to escape the strings before loading them.
If you are using Python 3.6 or later I think you can get this to work with
json.loads(scraped.encode('unicode_escape'))
As per the docs, this will give you an
Encoding suitable as the contents of a Unicode literal in
ASCII-encoded Python source code, except that quotes are not escaped.
Decodes from Latin-1 source code. Beware that Python source code
actually uses UTF-8 by default.
Which seems like exactly what you need.
Ok. so since I am on windows, I have to set the console to handle special characters. I did this by typing chcp 65001 into the terminal. I also use a regular expression and chained the string manipulation functions which is the python way anyways.
usable_json = json.loads(re.search('start_sub_string(.*)end_sub_string', hxs.xpath("//script[contains(., 'some_string')]//text()").extract_first()).group(1))
Then everything went smoth. I'll sort out the encoding and escaping when writing to database down the line.

Processing delimiters with python

Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.
Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.

How do I better handle encoding and decoding involving unicode characters annd going back and forth from ascii

I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look
You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.
I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.
Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.

converting UTF-16 special characters to UTF-8

I'm working in django and Python and I'm having issues with saving utf-16 characters in PostgreSQL. Is there any method to convert utf-16 to utf-8 before saving?
I'm using python 2.6 here is my code snippets
sample_data="This is the time of year when Travel & Leisure, TripAdvisor and other travel media trot out their “Best†lists, so I thought I might share my own list of outstanding hotels I’ve had the good fortune to visit over the years."
Above data contains some latin special characters but it is not showing correctly, I just want to show those latin special characters in appropriate formats.
There are no such things as "utf-16 characters". You should show your data by using print repr(data), and tell us which pieces of your data you are having trouble with. Show us the essence of your data e.g. the repr() of "Leisure “Best†lists I’ve had"
What you actually have is a string of bytes containing text encoded in UTF-8. Here is its repr():
'Leisure \xe2\x80\x9cBest\xe2\x80\x9d lists I\xe2\x80\x99ve had'
You'll notice 3 clumps of guff in what you showed. These correspond to the 3 clumps of \xhh in the repr.
Clump1 (\xe2\x80\x9c) decodes to U+201C LEFT DOUBLE QUOTATION MARK.
Clump 2 is \xe2\x80\x9d. Note that only the first 2 "latin special characters" aka "guff" showed up in your display. That is because your terminal's encoding is cp1252 which doesn't map \x9d; it just ignored it. Unicode is U+201D RIGHT DOUBLE QUOTATION MARK.
Clump 3: becomes U+2019 RIGHT SINGLE QUOTATION MARK (being used as an apostrophe).
As you have UTF-8-encoded bytes, you should be having no trouble with PostgreSQL. If you are getting errors, show your code, the full error message and the full traceback.
If you really need to display the guff to your Windows terminal, print guff.decode('utf8').encode('cp1252') ... just be prepared for unicode characters that are not supported by cp1252.
Update in response to comment I dont have any issue with saving data,problem is while displaying it is showing weired characters,so what iam thinking is convert those data before saving am i right?
Make up your mind. (1) In your question you say "I'm having issues with saving utf-16 characters in PostgreSQL". (2) Now you say "I dont have any issue with saving data,problem is while displaying it is showing weired characters"
Summary: Your sample data is encoded in UTF-8. If UTF-8 is not acceptable to PostgreSQL, decode it to Unicode. If you are having display problems, first try displaying the corresponding Unicode; if that doesn't work, try an encoding that your terminal will support (presumably one of the cp125X family.
This works for me to convert strings: sample_data.decode('mbcs').encode('utf-8')

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta
You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)
Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.
You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Categories

Resources