Reading "raw" Unicode-strings in Python

Reading "raw" Unicode-strings in Python - python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Related

How to test for encoding type Python 2.7?

I am trying to troubleshoot a problem I am having regarding foreign characters (any and all alphabets). My script (2.7 python) receives the characters (mix of english alphabet and other foreign characters) as unicode json, and it is sent to a database insert function for insertion into some tables using psycopg2. This works perfect as a script, but once not so much as a service (foreign chars get inserted as nonsense). This unicoding/encoding/decoding stuff is so confusing! I am trying to follow along with this ( https://www.pythoncentral.io/python-unicode-encode-decode-strings-python-2x/ ) in the hopes of understanding what I am receiving exactly and then sending to the database, but it seems to me that I need to know what the encoding is at various stages. How do you GET what the encoding type is? Sorry, this must be so simple, but I am not finding how to get that info, and other's questions on this matter haven't exactly been answered, in my opinion. This can't be that elusive. Please help.
Add-on info as requested...
-Yes, would love to move to 3.x, but cannot for now.
-Currently it is mostly me testing, it's not live for users yet. I am testing and developing from a Windows 2012 Server AWS machine, and the service is hosted on a similar machine.
So yes - how do you find the locale info?
Have done some testing with the frontend dev (js) and he states the json input is coming to me as url encoded... when I type it, it just says unicode. Thoughts??

Don't rely on what the default system encoding is. Instead, always set it yourself:
# read in a string (a bunch of bytes the encoding of which you should know)
str = sys.stdin.read();
# decode the bytes into a unicode string
u = unicode.decode(str, encoding='ISO-8859-1', errors=replace);
# do stuff with the string
# ...
# always operate on unicode stuff inside your program.
# make a 'unicode sandwhich'.
# ...
# encode the bytes in preparation for writing them out
out = unicode.encode(u, encoding='UTF-8')
# great, now you have bytes you can just write out
with open('myfile.txt', 'wb') as f:
rb.write(out)
Notice, I hard-coded the encoding throughout.
But what if you don't know the encoding of the input? Well, that's a problem. You need to know that. But I also understand unicode can be painful and there's this guy from the python community who tells you how to stop the pain (video).
Note, one of the BIG changes in python 3 is better unicode support. Instead of using the unicode package and the confusing py2 str type, in python 3 str type is just what python 2's unicode type was, and you can specify encodings in more convenient places:
with open('myfile.txt', 'w', encoding=UTF-8, errors='ignore') as f:
# ...

Unicode Playlists for Sonos from Python

I'm working to export a small subset of music from my iTunes library to an external drive, for use with a Sonos speaker (via Media Library on Sonos). All was going fine until I came across some unicode text in track, album and artist names.
I'm moving from iTunes on Mac to a folder structure on Linux (Ubuntu), and the file paths all contain the original Unicode names and these are displayed and play fine from Sonos in the Artist / Album view. The only problem is playlists, which I'm generating via a bit of Python3 code.
Sonos does not appear to support UTF-8 encoding in .m3u / .m3u8 playlists. The character ÷ was interpreted by Sonos as Ã·, which after a bit of Googling I found was clearly mixing up UTF-8 and UTF-16 (÷ 0xC3 0xB7 in UTF-8, whilst Ã is U+00C3 in UTF-16 and · is U+00B7 in UTF-16). I've tried many different ways of encoding it, and just can't get it to recognise tracks with non-standard (non-ASCII?) characters in their names.
I then tried .wpl playlists, and thought I'd solved it. Tracks with characters such as ÷ and • in their path now work perfectly, just using those characters in their unicode / UTF-8 form in the playlist file itself.
However, just as I was starting to tidy up and finish off the code, I found some other characters that weren't being handled correctly: ö, å, á and a couple of others. I've tried both using these as their original unicode characters, but also as their encoded XML identifier e.g. ́ Using this format doesn't make a difference to what works or does not work - ÷ (÷) and • (•) are fine, whilst ö (ö), å (å) and á (á) are not.
I've never really worked with unicode / UTF-8 before, but having read various guides and how-to's I feel like I'm getting close but probably just missing something simple. The fact that some unicode characters work now, and others don't, makes me think it's got to be something obvious! I'm guessing the difference is that accents modify the previous character, rather than being a character in itself, but tried removing the previous letter and that didn't work!
Within Python itself I'm not doing anything particularly clever. I read in the data from iTunes' XML file using:
with open(settings['itunes_path'], 'rb') as itunes_handle:
itunes_library = plistlib.load(itunes_handle)
For export I've tried dozens of different options, but generally something like the below (sometimes with encoding='utf-8' and various other options):
with open(dest_path, 'w') as playlist_file:
playlist_file.write(generated_playlist)
Where generated_playlist is the result of extracting and filtering data from itunes_library, having run urllib.parse.unquote() on any iTunes XML data.
Any thoughts or tips on where to look would be very much appreciated! I'm hoping that to someone who understands Unicode better the answer will be really really obvious! Thanks!
Current version of the code available here: https://github.com/dwalker-uk/iTunesToSonos

With thanks to #lenz for the suggestions above, I do now have unicode playlists fully working with Sonos.
A couple of critical points that should save someone else a lot of time:
Only .wpl playlists seem to work. Unicode will not work with .m3u or .m3u8 playlists on Sonos.
Sonos needs any unicode text to be normalised into NFC form - I'd never heard of this before, but essentially means that any accented characters have to be represented by a single character, not as a normal character with a separate accent.
The .pls playlist, which is an XML format, needs to have unicode characters encoded in an XML format, i.e. é is represented in the .pls file as é.
The .pls file also needs the XML reserved characters (& < > ' ") in their escaped form, i.e & is &.
In Python 3, converting a path from iTunes XML format into something suitable for a .pls playlist on Sonos, needs the following key steps:
left = len(itunes_library['Music Folder'])
path_relative = 'Media/' + itunes_library['Tracks'][track_id]['Location'][left:]
path_unquoted = urllib.parse.unquote(path_relative)
path_norm = unicodedata.normalize('NFC', path_unquoted)
path = path_norm.replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"')
playlist_wpl += '<media src="%s"/>\n' % path
with open(pl_path, 'wb') as pl_file:
pl_file.write(playlist_wpl.encode('ascii', 'xmlcharrefreplace'))
A full working demo for exporting from iTunes for use in Sonos (or anything else) as .pls is available here: https://github.com/dwalker-uk/iTunesToSonos
Hope that helps someone!

decode/encode problems

I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'Ã¤' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.

You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?

By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?

Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.

How do I better handle encoding and decoding involving unicode characters annd going back and forth from ascii

I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look

You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.

I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.

Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.

Working with foreign symbols in python

I'm parsing a JSON feed in Python and it contains this character, causing it not to validate.
Is there a way to handle these symbols? Can they be converted or is they're a tidy way to remove them?
I don't even know what this symbol is called or what causes them, otherwise I would research it myself.
EDIT: Stackover Flow is stripping the character so here:
http://files.getdropbox.com/u/194177/symbol.jpg
It's that [?] symbol in "Classic 80s"

That probably means the text you have is in some sort of encoding, and you need to figure out what encoding, and convert it to Unicode with a thetext.decode('encoding') call.
I not sure, but it could possibly be the [?] character, meaning that the display you have there also doesn't know how to display it. That would probably mean that the data you have is incorrect, and that there is a character in there that doesn't exist in the encoding that you are supposed to use. To handle that you call the decode like this: thetext.decode('encoding', 'ignore'). There are other options than ignore, like "replace", "xmlcharrefreplace" and more.

JSON must be encoded in one of UTF-8, UTF-16, or UTF-32. If a JSON file contains bytes which are illegal in its current encoding, it is garbage.
If you don't know which encoding it's using, you can try parsing using my jsonlib library, which includes an encoding-detector. JSON parsed using jsonlib will be provided to the programmer as Unicode strings, so you don't have to worry about encoding at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.