I have a python script and recently noticed that I was hitting some encoding errors on certain input. I noticed that "smart quotes" were causing problems. I'd like to know advice on how to overcome this. I am using Python 2, so need to tell my script that I want to encode everything in UTF-8.
I thought doing this was enough:
mystring.encode("utf-8")
and largely it worked, until I came across smart quotes (and there are possibly many other things that will cause problems, hence why I'm posting here.) For example:
mystring = "hi"
mystring.encode("utf-8")
output is
'hi'
But for this:
mystring2 = "’"
mystring.encode("utf-8")
output is
UnicodeDecodeError
Traceback (most recent call last)
<ipython-input-21-f563327dcd27> in <module>()
----> 1 mystring.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 0: ordinal not in range(128)
I created a function to handle the JSON input I get (sometimes I get null/None values, and sometimes numeric values, although mostly unicode, hence why i have the couple of if statements):
def xstr(s):
if s is None:
return ''
if isinstance(s, basestring):
return str(s.encode("utf-8"))
else:
return str(s)
This has worked quite well (until this smart quotes issue)
The two questions I have are:
Why can't "smart quotes" be encoded in UTF-8, and are there other limitations of UTF-8 or am I completely misinterpreting what I am seeing?
Is the approach I have used (ie using my custom function) the best way to handle this? I tried using a try/except to catch the cases of smart quotes, but that didn't work.
Python cannot encode the string because it doesn't know its current encoding. You'll need to use u"’" in Python 2 to tell Python that this is a Unicode string. ("\xe2" happens to be the first byte of the UTF-8 encoding of this character, but Python doesn't know that it's in UTF-8 because you haven't told it. You could put a -*- coding: utf-8 -*- comment near the top of your file; or unambiguously represent the character as u"\u2219".)
Similarly, to convert a string you read from disk, you have to coerce into Unicode so that you can then encode as UTF-8.
print(s.decode('iso-8859-1').encode('utf-8'))
Here, of course, 'iso-8859-1' is just a random guess. You have to know the encoding, or risk getting incorrect output.
Related
I have following code:
return render_template(
'sample.html',
title=('Härre').decode('utf-8'),
year=datetime.now().year,
message = ("Härre guut").decode('utf-8')
)
Above code is working fine. But I want to know is it possible to enable automatic decoding for special characters for specific class? So that my code becomes like this:
return render_template(
'sample.html',
title=('Härre'),
year=datetime.now().year,
message = ("Härre guut")
)
If yes then how it is done?
In my case special character is letter ä, and it can be decoded with utf-8.
I have tried to to add following line at first line of my code:
# -*- coding: utf-8 -*-
However that won't help. I get following error if I try to enable above line and take out all decode('utf-8') parts:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 12: ordinal not in range(128)
Which is very clear error. It tries to use 'ascii' codec to decode special characters of the code.
If you wonder why I want to enable such functionality, the answer is that in future I'm going to have more render_template() method. This method is used with Flask framework.
If you're only talking about string which you are typing, I'd try putting a u in front of them (eg. u'string'). The u tells python the string is unicode. I've used that successfully in similar circumstances in the past.
Also, according to the flask docs (way at the bottom), to use that # -*- coding: ... bit properly, you have to set your text editor to UTF-8, otherwise it's saving in ascii, which might be why you're still getting that error.
At the risk of being too trivial, you could make this decoding automatic by using Python 3, which uses Unicode instead of ascii.
Otherwise, as Brandon indicated, you do have to take pains with Python 2 to appropriately decode you inputs, use Unicode within your program, prefacing strings awing 'u', and the encoding when you output data, usually to utf-8, but whatever is appropriate for you.
Decoding and encoding at the "the edges" and using Unicode in between, is a lot of work, but necessary. It's a good reason to make the switch to Python 3. :-)
Simply define your strings as Unicode strings using the u prefix:
return render_template(
'sample.html',
title=(u'Härre'),
year=datetime.now().year,
message = (u"Härre guut")
)
As you've baked non-ASCII into your source code, set the 'coding' bit at the top of your source code. Ensure your editor's character encoding matches the coding header.
# -*- coding: utf-8 -*-
I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.
Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.
Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.
My background is in Perl, but I'm giving Python plus BeautifulSoup a try for a new project.
In this example, I'm trying to extract and present the link targets and link text contained in a single page. Here's the source:
table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))
All those explicit calls to .encode('utf-8') are my attempt to make this work, but they don't seem to help -- it's likely that I'm completely misunderstanding something about how Python 2.7 handles Unicode strings.
Anyway. This works fine up until it encounters U+2013 in a URL (yes, really). At that point it bombs out with:
Traceback (most recent call last):
File "./test2.py", line 30, in <module>
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)
Presumably .format(), even applied to a Unicode string, is playing silly-buggers and trying to do a .decode() operation. And as ASCII is the default, it's using that, and of course it can't map U+2013 to an ASCII character, and thus...
The options seem to be to remove it or convert it to something else, but really what I want is to simply preserve it. Ultimately (this is just a little test case) I need to be able to present working clickable links.
The BS3 documentation suggests changing the default encoding from ASCII to UTF-8 but reading comments on similar questions that looks to be a really bad idea as it'll muck up dictionaries.
Short of using Python 3.2 instead (which means no Django, which we're considering for part of this project) is there some way to make this work cleanly?
First, note that your two code samples disagree on the text of the problematic line:
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
vs
line_out = unicode(table_row.format(link_text, link_target))
The first is the one from the traceback, so it's the one to look at. Assuming the rest of your first code sample is accurate, table_row is a byte-string, because you took a unicode string and encoded it. Byte strings can't be encoded, so Python 2 implicitly converts table_row from byte-string to unicode by decoding it as ascii. Hence the error message, "UnicodeDecodeError from ascii".
You need to decide what strings will be byte strings and which will be unicode strings, and be disciplined about it. I recommend keeping all text as Unicode strings as much as possible.
Here's a presentation I gave at PyCon that explains it all: Pragmatic Unicode, or, How Do I Stop The Pain?
So when i post a name or text in mod_python in my native language i get:
македонија
And i also get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
When i use:
hparser = HTMLParser.HTMLParser()
req.write(hparser.unescape(text))
How can i decode it?
It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of
Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character1. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.
The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.
To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.
Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.
1Sort of.
EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.
Anyway, if you step through what you're doing in the shell you'll see
>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'
I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like
>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'
But we could also pick a different encoding!
>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'
You'll need to decide what encoding you want to use.
What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try
>>> text.encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
you just get an error, because you can't encode those code points in ASCII.
When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.
So you need to do req.write(hparser.unescape(text).encode("some-encoding")).
I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"
When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).
add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")
you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').