In Django, why do I get problems with utf-8 encoded strings?

In Django, why do I get problems with utf-8 encoded strings? - python

I'm a German developer writing web applications for Germans, which means I cannot by any means rely on plain ASCII encoding. At least characters like ä, ö, ü, ß have to be supported.
Fortunately, Django treats ByteStrings as utf-8 encoded by default (as described in the docs). So it should just work, if I add the # -*- coding: utf-8 -*- line to the beginning of each .py file and set the editor encoding, shouldn't it? Well, it does most of the time...
But I seem to miss something when it comes to URLs. Or maybe that has not to do anything with URLs but until now I didn't notice any other encoding misbehavior. There are two cases I can remember as examples:
The URL pattern url(r'^([a-z0-9äöüß_\-]+)/$', views.view_page) doesn't recognize URLs containing ä, ö, ü, ß at all. Those characters are simply ignored.
The following code of a view function throws an Exception:
def do_redirect(request, id):
return redirect('/page/{0}'.format(id))
Where the id argument is captured from the URL like the one in the first example. If I fix the URL pattern (by specifying it as unicode string) and than access /ä/, I get the Exception
UnicodeEncodeError at /ä/
'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
However, trying the following code for the view function:
def do_redirect(request, id):
return redirect('/page/' + id)
everything works out fine. That makes me belief the actual problem lies not within Django but derives from Python, treating ByteStrings as ASCII. I'm not that much into encoding but the problem in the second example is obviously the format() method of the String object. So, in the first example it might fail because of the way Python handles regular expressions (though I don't know if Django uses the re module or something else).
My workaround until now is just prefixing the string with u whenever such an error occurs. That's a bad solution since I might easily overlook something. I tried marking every Python string as unicode but that causes other exceptions and is quite ugly.
Does anyone know exactly, what the problem is and how to solve it in a pleasant way (i.e. a way that doesn't let your head explode when the code grows bigger)?
Thanks in advance!
EDIT: For my regular expression I found out, why the u is needed. Specifying a string as Raw String (r) makes it being interpreted as ASCII. Leaving the r away makes the regex work without the u but introduces some headache with backslashes.

Prefixing your strings with u is the solution.
If it's a problem for you, then it looks like a symptom of a more general problem: you have a lot of magic constants in your code. It is bad (and you already see why). Try to avoid them, for example you can use named url pattern or view name for redirecting instead of re-typing the part of URL.
If you can't avoid them, turn them into named constants, and place their assignments in one place. Then, you'll see that all of them are prefixed properly, and it will be difficult to overlook it.

In django 1.4, one of the new features is better support for url internationalization, including support for translating URLs.
This would go a long way in helping you out, but it doesn't mean you should ignore the other advice as that is for Python in general and applies to everything, not just django.

Related

Encode a raw string so it can be decoded as json

I am throwing in the towel here. I'm trying to convert a string scraped from the source code of a website with scrapy (injected javascript) to json so I can easily access the data. The problem comes down to a decode error. I tried all kinds of encoding, decoding, escaping, codecs, regular expressions, string manipulations and nothing works. Oh, using Python 3.
I narrowed down the culprit on the string (or at least part of it)
scraped = '{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
scraped_raw = r'{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
data = json.loads(scraped_raw) #<= works
print(data["propertyNotes"])
failed = json.loads(scraped) #no work
print(failed["propertyNotes"])
Unfortunately, I cannot find a way for scrapy/splash to return the string as raw. So, somehow I need to have python interprets the string as raw while it is loading the json. Please help
Update:
What worked for that string was json.loads(str(data.encode('unicode_escape'), 'utf-8')) However, it didnt work with the larger string. The error I get doing this is JSONDecodeError: Invalid \escape on the larger json string

The problem exists because the string you're getting has escaped control characters which when interpreted by python become actual bytes when encoded (while this is not necessarily bad, we know that these escaped characters are control characters that json would not expect). Similar to Turn's answer, you need to interpret the string without interpreting the escaped values which is done using
json.loads(scraped.encode('unicode_escape'))
This works by encoding the contents as expected by the latin-1 encoding whilst interpreting any \u003 like escaped character as literally \u003 unless it's some sort of control character.
If my understanding is correct however, you may not want this because you then lose the escaped control characters so the data might not be the same as the original.
You can see this in action by noticing that the control chars disappear after converting the encoded string back to a normal python string:
scraped.encode('unicode_escape').decode('utf-8')
If you want to keep the control characters you're going to have to attempt to escape the strings before loading them.

If you are using Python 3.6 or later I think you can get this to work with
json.loads(scraped.encode('unicode_escape'))
As per the docs, this will give you an
Encoding suitable as the contents of a Unicode literal in
ASCII-encoded Python source code, except that quotes are not escaped.
Decodes from Latin-1 source code. Beware that Python source code
actually uses UTF-8 by default.
Which seems like exactly what you need.

Ok. so since I am on windows, I have to set the console to handle special characters. I did this by typing chcp 65001 into the terminal. I also use a regular expression and chained the string manipulation functions which is the python way anyways.
usable_json = json.loads(re.search('start_sub_string(.*)end_sub_string', hxs.xpath("//script[contains(., 'some_string')]//text()").extract_first()).group(1))
Then everything went smoth. I'll sort out the encoding and escaping when writing to database down the line.

decode/encode problems

I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'Ã¤' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.

You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?

By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?

Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.

sqlite remove non utf-8 characters

I have an sqlite db that has some crazy ascii characters in it and I would like to remove them, but I have no idea how to go about doing it. I googled some stuff and found some people saying to use REGEXP with mysql, but that threw an error saying REGEXP wasn't recognized.
Here is the error I get:
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'table_name' with text ...
Thanks for the help

Well, if you really want to shoehorn a rich unicode string into a plain ascii string
(and don't mind some goofs), you could use this:
import unicodedata as ud
def shoehorn_unicode_into_ascii(s):
# This removes accents, but also other things, like ß‘’“”
return ud.normalize('NFKD', s).encode('ascii','ignore')
For a more complete solution (with somewhat fewer goofs, but requiring a third-party module unidecode), see this answer.
Really, though, the best solution is to work with unicode data throughout your code as much as possible, and drop to an encoding only when necessary.

django.utils.encoding has a greate set of robust unicode encoding and decoding functions.

URL encoding/decoding with Python

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.

url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>â‚¬Â£Â¥â€¢.,?!'
>>> # oops, nasty Â means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.

i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.

You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.