How to properly handle non ASCII strings in python

How to properly handle non ASCII strings in python - python

I'm building an application that in the database has data with latin symbols. Users are able to enter this data.
What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template.
This is a bit annoying and I'm wondering if there is any better way of handling this.

Python's unicode type is designed to be the "natural" representation for strings. Besides the unicode type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.
I would recommend you see about converting your db data to UTF-8.
If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode type right away.
Using unicode types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).
For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the safe filter:
<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>
See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).

try add this to the top of your program.
import sys
reload(sys)
sys.setdefaultencoding('latin2')
We have to reload sys because:
>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>

Related

How to validate the decoding of a bytearray without raising an exception?

Is there a way to try to decode a bytearray without raising an error if the encoding fails?
EDIT: The solution needn't use bytearray.decode(...). Anything library (preferably standard) that does the job would be great.
Note: I don't want to ignore errors, (which I could do using bytearray.decode(errors='ignore')). I also don't want an exception to be raised. Preferably, I would like the function to return None, for example.
my_bytearray = bytearray('', encoding='utf-8')
# ...
# Read some stream of bytes into my_bytearray.
# ...
text = my_bytearray.decode()
If my_bytearray doesn't contain valid UTF-8 text, the last line will raise an error.
Question: Is there a way to perform the validation but without raising an error?
(I realize that raising an error is considered "pythonic". Let's assume this is undesirable for some or other good reason.)
I don't want to use a try-catch block because this code gets called thousands of times and I don't want my IDE to stop every time this exception is raised (whereas I do want it to pause on other errors).

You could use the suppress context manager to suppress the exception and have slightly prettier code than with try/except/pass:
import contextlib
...
return_val = None
with contextlib.suppress(UnicodeDecodeError):
return_val = my_bytearray.decode('utf-8')

The chardet module can be used to detect the encoding of a bytearray before calling bytearray.decode(...).
The Code:
import chardet
identity = chardet.detect(my_bytearray)
The method chardet.detect(...) returns a dictionary with the following format:
{
'confidence': 0.99,
'encoding': 'ascii',
'language': ''
}
One could check analysis['encoding'] to confirm that my_bytearray is compatible with an expected set of text encoding before calling my_bytearray.decode().
One consideration of using this approach is that the encoding indicated by the analysis might indicate one of a number of equivalent encodings. In this case, for instance, the analysis indicates that the encoding is ASCII whereas it could equivalently be UTF-8.
(Credit to #simon who pointed this out on StackOverflow here.)

Writing unicode symbols to files (as opposed to unicode code)

I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!

From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.

Python: base64.b64decode() vs .decode?

The Code Furies have turned their baleful glares upon me, and it's fallen to me to implement "Secure Transport" as defined by The Direct Project. Whether or not we internally use DNS rather than LDAP for sharing certificates, I'm obviously going to need to set up the former to test against, and that's what's got me stuck. Apparently, an X509 cert needs some massaging to be used in a CERT record, and I'm trying to work out how that's done.
The clearest thing I've found is a script on Videntity's blog, but not being versed in python, I'm hitting a stumbling block. Specifically, this line crashes:
decoded_clean_pk = clean_pk.decode('base64', strict)
since it doesn't seem to like (or rather, to know) whatever 'strict' is supposed to represent. I'm making the semi-educated guess that the line is supposed to decode the base64 data, but I learned from the Debian OpenSSL debacle some years back that blindly diddling with crypto-related code is a Bad Thing(TM).
So I turn the illustrious python wonks on SO to ask if that line might be replaced by this one (with the appropriate import added):
decoded_clean_pk = base64.b64decode(clean_pk)
The script runs after that change, and produces correct-looking output, but I've got enough instinct to know that I can't necessarily trust my instincts here. :)

This line should've work if you would've called like this:
decoded_clean_pk = clean_pk.decode('base64', 'strict')
Notice that strict has to be a string, otherwise python interpreter would try to search for a variable named strict and if it didn't find it or otherwise has other value than: strict, ignore, and replace, it'll probably would've complain about it.
Take a look at this code:
>>>b=base64.b64encode('hello world')
>>>b.decode('base64')
'hello world'
>>>base64.b64decode(b)
'hello world'
Both decode and b64decode works the same when .decode is passed the base64 argument string.
The difference is that str.decode will take a string of bytes as arguments and will return it's Unicode representation depending on the encoding argument you pass as first parameter. In this case, you're telling it to handle a bas64 string so it will do it ok.
To answer your question, both works the same, although b64decode/encode are meant to work only with base64 encodings and str.decode can handle as many encodings as the library is aware of.
For further information take a read at both of the doc sections: decode and b64decode.
UPDATE: Actually, and this is the most important example I guess :) take a look at the source code for encodings/base64_codec.py which is that decode() uses:
def base64_decode(input,errors='strict'):
""" Decodes the object input and returns a tuple (output
object, length consumed).
input must be an object which provides the bf_getreadbuf
buffer slot. Python strings, buffer objects and memory
mapped files are examples of objects providing this slot.
errors defines the error handling to apply. It defaults to
'strict' handling which is the only currently supported
error handling for this codec.
"""
assert errors == 'strict'
output = base64.decodestring(input)
return (output, len(input))
As you may see, it actually uses base64 module to do it :)
Hope this clarify in some way your question.

Running JSON through Python's eval()?

DO NOT DO THIS.
This question is still getting upvotes, so I wanted to add a warning to it. If you're using Python 3, just use the included json package. If you're using Python 2, do everything you can to move to Python 3. If you're prevented from using Python 3 (my condolences), use the simplejson package suggested by James Thompson.
Original question follows.
Best practices aside, is there a compelling reason not to do this?
I'm writing a post-commit hook for use with a Google Code project, which provides commit data via a JSON object. GC provides an HMAC authentication token along with the request (outside the JSON data), so by validating that token I gain high confidence that the JSON data is both benign (as there's little point in distrusting Google) and valid.
My own (brief) investigations suggest that JSON happens to be completely valid Python, with the exception of the "\/" escape sequence — which GC doesn't appear to generate.
So, as I'm working with Python 2.4 (i.e. no json module), eval() is looking really tempting.
Edit: For the record, I am very much not asking if this is a good idea. I'm quite aware that it isn't, and I very much doubt I'll ever use this technique for any future projects even if I end up using it for this one. I just wanted to make sure that I know what kind of trouble I'll run into if I do. :-)

If you're comfortable with your script working fine for a while, and then randomly failing on some obscure edge case, I would go with eval.
If it's important that your code be robust, I would take the time to add simplejson. You don't need the C portion for speedups, so it really shouldn't be hard to dump a few .py files into a directory somewhere.
As an example of something that might bite you, JSON uses Unicode and simplejson returns Unicode, whereas eval returns str:
>>> simplejson.loads('{"a":1, "b":2}')
{u'a': 1, u'b': 2}
>>> eval('{"a":1, "b":2}')
{'a': 1, 'b': 2}
Edit: a better example of where eval() behaves differently:
>>> simplejson.loads('{"X": "\uabcd"}')
{u'X': u'\uabcd'}
>>> eval('{"X": "\uabcd"}')
{'X': '\\uabcd'}
>>> simplejson.loads('{"X": "\uabcd"}') == eval('{"X": "\uabcd"}')
False
Edit 2: saw yet another problem today pointed out by SilentGhost: eval doesn't handle true -> True, false -> False, null -> None correctly.
>>> simplejson.loads('[false, true, null]')
[False, True, None]
>>> eval('[false, true, null]')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "<string>", line 1, in <module>
NameError: name 'false' is not defined
>>>

The point of best practices is that in most cases, it's a bad idea to disregard them. If I were you, I'd use a parser to parse JSON into Python. Try out simplejson, it was very straightforward for parsing JSON when I last tried it and it claims to be compatible with Python 2.4.
I disagree that there's little point in distrusting Google. I wouldn't distrust them, but I'd verify the data you get from them. The reason that I'd actually use a JSON parser is right in your question:
My own (brief) investigations suggest that JSON happens to be completely valid Python, with the exception of the "/" escape sequence — which GC doesn't appear to generate.
What makes you think that Google Code will never generate an escape sequence like that?
Parsing is a solved problem if you use the right tools. If you try to take shortcuts like this, you'll eventually get bitten by incorrect assumptions, or you'll do something like trying to hack together a parser with regex's and boolean logic when a parser already exists for your language of choice.

One major difference is that a boolean in JSON is true|false, but Python uses True|False.
The most important reason not to do this can be generalized: eval should never be used to interpret external input since this allows for arbitrary code execution.

evaling JSON is a bit like trying to run XML through a C++ compiler.
eval is meant to evaluate Python code. Although there are some syntactical similarities, JSON isn't Python code. Heck, not only is it not Python code, it's not code to begin with. Therefore, even if you can get away with it for your use-case, I'd argue that it's a bad idea conceptually. Python is an apple, JSON is orange-flavored soda.

What is the default content-type/charset?

According to this answer: urllib2 read to Unicode
I have to get the content-type in order to change to Unicode. However, some websites don't have a "charset".
For example, the ['content-type'] for this page is "text/html". I can't convert it to Unicode.
encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable
Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?
No, there isn't. You must guess.
Trivial approach: try and decode as UTF-8. If it works, great, it's probably UTF-8. If it doesn't, choose a most-likely encoding for the kinds of pages you're browsing. For English pages that's cp1252, the Windows Western European encoding. (Which is like ISO-8859-1; in fact most browsers will use cp1252 instead of iso-8859-1 even if you specify that charset, so it's worth duplicating that behaviour.)
If you need to guess other languages, it gets very hairy. There are existing modules to help you guess in these situations. See eg. chardet.

Well, I just browsed the given URL, which redirects to
http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video
then hit Ctrl + U (view source) in Firefox and it shows
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
#Konrad: what do you mean "seems as though ... uses ISO-8859-1"??
#alex: what makes you think it doesn't have a "charset"??
Look at the code you have (which we guess is the line that cause the error (please always show full traceback and error message!)):
htmlSource = unicode(htmlSource, encoding)
and the error message:
TypeError: 'int' object is not callable
That means that unicode doesn't refer to the built-in function, it refers to an int. I recall that in your other question you had something like
if unicode == 1:
I suggest that you use some other name for that variable -- say use_unicode.
More suggestions: (1) always show enough code to reproduce the error (2) always read the error message.

htmlSource=htmlSource.decode("utf8") should work for most cases, except you are crawling non-English encoding sites.
Or you could write the force decode function like this:
def forcedecode(text):
for x in ["utf8","sjis","cp1252","utf16"]:
try:return text.decode(x)
except:pass
return "Unknown Encoding"

If there's no explicit content type, it should be ISO-8859-1 as stated earlier in the answers. Unfortunately that's not always the case, which is why browser developers spent some time on getting algorithms going that try to guess the content type based on the content of your page.
Luckily for you, Mark Pilgrim did all the hard work on porting the Firefox implementation to Python, in the form of the chardet module. His introduction on how it works for one of the chapters of Dive Into Python 3 is also well worth reading.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.