decode/encode problems - python

I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'ä' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.

You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?

By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?

Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.

Related

python crawler get messy code which seems has muti type of coding

I got a location u'\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82', which actually should be '\xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82'. How can I decode something like this?
I suggest you read python 2.7 unicode.
\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82 does not equal \xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82,so I suppose there is something wrong with your crawler code.
In python2.x,you should be careful with the encoding problem.In Python2 we have two text types:
str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Python2 provides a migration path from non-Unicode into Unicode by allowing coercion of byte-strings and non byte-strings.
You can check out More About Unicode in Python 2 and 3.
Also you can add this at the start of your script,it sets system default encoding as utf-8 .It's userful for testing program and it will fix your issue.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
As a matter of fact,I don't suggest programmer use this in large program.It might trigger other issues.
The encoding problem in Python2.x is really discouraged,and if you want to avoid encoding problem, you should start to think seriously about switching to Python3.
Hope this helps.

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.
# -*- coding: utf-8 -*-
import sys
class Token:
def __init__(self, string, final=False):
self.value = string
self.final = final
def __str__(self):
return self.value
def __repr__(self):
return self.value
print(sys.getdefaultencoding())
print(sys.stdout.encoding)
# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)
reload(sys).setdefaultencoding("utf8")
# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)
My question is why the two encoding variables are different in the first place
They serve different purposes.
sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.
sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.
sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.
btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).
The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.
Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.
The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.
how do I manage to use the wrong encoding in this simple piece of code?
Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console
The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals
Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.
I have noticed the same behaviour of some standard code (mailman library).
Thanks for your analysis, it helped me save some time. :-)
The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.
There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.
For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.

Python program works in Eclipse but not when I run it directly (Unicode stuff)

I have searched and found some related problems but the way they deal with Unicode is different, so I can't apply the solutions to my problem.
I won't paste my whole code but I'm sure this isolated example code replicates the error:
(I'm also using wx for GUI so this is like inside a class)
#coding: utf-8
...
something = u'ЧЕТЫРЕ'
//show the Russian text in a Label on the GUI
self.ExampleLabel.SetValue(str(self.something))
On Eclipse everything works perfectly and it displays the Russian characters. However when I try to open up Python straight through the file I get this error on the CL:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11:
ordinal not in range(128)
I figured this has something to do with the CL not being able to ouput the Unicode chars and Eclipse doing behind-the-scene magic. Any help on how to make it so that it works on its own?
When you call str() on something without specifying an encoding, the default encoding is used, which depends on the environment your program is running in. In Eclipse, that's different from the command line.
Don't rely on the default encoding, instead specify it explicitly:
self.ExampleLabel.SetValue(self.something.encode('utf-8'))
You may want to study the Python Unicode HOWTO to understand what encoding and str() do with unicode objects. The wxPython project has a page on Unicode usage as well.
Try self.something.encode('utf-8') instead.
If you use repr instead of str it should handle the conversion for you and also cover the case that the object is not always of type string, but you may find that it gives you an extra set of quotes or even the unicode u in your context. repr is safer than str - str assumes ascii encoding, but repr is going to show your codepoints in the same way that you would see them in code, since wrapping with eval is supposed to convert it back to what it was - the repr has to be in a form that the python code would be in, namely ascii safe since most python code is written in ascii.

In Django, why do I get problems with utf-8 encoded strings?

I'm a German developer writing web applications for Germans, which means I cannot by any means rely on plain ASCII encoding. At least characters like ä, ö, ü, ß have to be supported.
Fortunately, Django treats ByteStrings as utf-8 encoded by default (as described in the docs). So it should just work, if I add the # -*- coding: utf-8 -*- line to the beginning of each .py file and set the editor encoding, shouldn't it? Well, it does most of the time...
But I seem to miss something when it comes to URLs. Or maybe that has not to do anything with URLs but until now I didn't notice any other encoding misbehavior. There are two cases I can remember as examples:
The URL pattern url(r'^([a-z0-9äöüß_\-]+)/$', views.view_page) doesn't recognize URLs containing ä, ö, ü, ß at all. Those characters are simply ignored.
The following code of a view function throws an Exception:
def do_redirect(request, id):
return redirect('/page/{0}'.format(id))
Where the id argument is captured from the URL like the one in the first example. If I fix the URL pattern (by specifying it as unicode string) and than access /ä/, I get the Exception
UnicodeEncodeError at /ä/
'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
However, trying the following code for the view function:
def do_redirect(request, id):
return redirect('/page/' + id)
everything works out fine. That makes me belief the actual problem lies not within Django but derives from Python, treating ByteStrings as ASCII. I'm not that much into encoding but the problem in the second example is obviously the format() method of the String object. So, in the first example it might fail because of the way Python handles regular expressions (though I don't know if Django uses the re module or something else).
My workaround until now is just prefixing the string with u whenever such an error occurs. That's a bad solution since I might easily overlook something. I tried marking every Python string as unicode but that causes other exceptions and is quite ugly.
Does anyone know exactly, what the problem is and how to solve it in a pleasant way (i.e. a way that doesn't let your head explode when the code grows bigger)?
Thanks in advance!
EDIT: For my regular expression I found out, why the u is needed. Specifying a string as Raw String (r) makes it being interpreted as ASCII. Leaving the r away makes the regex work without the u but introduces some headache with backslashes.
Prefixing your strings with u is the solution.
If it's a problem for you, then it looks like a symptom of a more general problem: you have a lot of magic constants in your code. It is bad (and you already see why). Try to avoid them, for example you can use named url pattern or view name for redirecting instead of re-typing the part of URL.
If you can't avoid them, turn them into named constants, and place their assignments in one place. Then, you'll see that all of them are prefixed properly, and it will be difficult to overlook it.
In django 1.4, one of the new features is better support for url internationalization, including support for translating URLs.
This would go a long way in helping you out, but it doesn't mean you should ignore the other advice as that is for Python in general and applies to everything, not just django.

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta
You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)
Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.
You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Categories

Resources