Decoding if it's not unicode - python

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:
def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')
...
Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.
During my experiments with decoding, I have run into several weird behaviours of Python. For instance:
>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)
Or
>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported
By the way. I'm using Python 2.6

You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.
def myfunction(text):
try:
text = unicode(text, 'utf-8')
except TypeError:
return text
print(myfunction(u'cer\xf3n'))
# cerón
When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.
Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.
So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.
See this Python bug ticket for an interesting discussion of the issue,
and also Guido van Rossum's blog:
"We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction. This means that we had to
drop a few codecs that don't fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API)."

I'm not aware of any good way to avoid the isinstance check in your function, but maybe someone else will be. I can point out that the two weirdnesses you cite are because you're doing something that doesn't make sense: Trying to decode into Unicode something that's already decoded into Unicode.
The first should instead look like this, which decodes the UTF-8 encoding of that string into the Unicode version:
>>> 'cer\xc3\xb3n'.decode('utf-8')
u'cer\xf3n'
And your second should look like this (not using a u'' Unicode string literal):
>>> unicode('hello', 'utf-8')
u'hello'

Related

Python UnicodeEncodeError for u'\u2019' while trying to create a CSV or export

I'm trying to export some data to CSV from out of a database, and I'm struggling to understand the following UnicodeEncodeError:
>>> sample
u'I\u2019m now'
>>> type(sample)
<type 'unicode'>
>>> str(sample)
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)
>>> print sample
I’m now
>>> sample.encode('utf-8', 'ignore')
'I\xe2\x80\x99m now'
I'm confused. Is it unicode or not? What does the UnicodeEncodeError actually mean in this context? Why does print work just fine? If I want to be able to save this data to a CSV file, how can I handle the encoding so that it does not generate an error when I try to use csv.writer's writerow?
Thanks for your help.
It is a Python unicode object, you used type(sample) to verify that. Also, it contains Unicode, so you can serialize it to a file that has one of the Unicode encodings.
The encoding error needs to be read carefully: It is the "ascii" codec that can't represent that string. ASCII is just the Unicode subset with codepoints below 127. Your string uses codepoint 0x2019, so it can't be encoded with ASCII.
print works because it is correctly implemented and it doesn't try to encode the string as ASCII. I think you would get similar errors if stdout was set up with e.g. Latin-1 as encoding, but it seems your system can handle a wider range of Unicode than that.
In order to write a CSV file, you could just use UTF-8 as encoding for that file. I haven't used the CSV module though, so I'm not sure exactly how. In any case, if it doesn't work, you should provide the exact code that doesn't as MCVE in a different question.
BTW: Please upgrade to Python 3! It has many improvements over the 2.x series, also concerning string/Unicode handling.

json encoding isn't considering encoding argument in python

Trying to encode a json file using the utf catalog (utf-8-sig), with this code
data =json.load(open("data.json", encoding = "utf-8-sig"))
But it appears that the encoding argument is being ignored throwing this error
Traceback (most recent call last):
File "app1.py", line 11, in <module>
print(k,v)
UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 141: ordinal not in
range(128)
Edit: the datatype of the file data.json is <class '_io.TextIOWrapper'>, and here's the full stack:
import json
data =json.load(open("data.json", encoding = "utf-8-sig"))
for k,v in data.items():
print(k,v)
Edit2:Binary sample of the file using print(open("data.json"), "rb").read(180)
b'{"abandoned industrial site": ["Site that cannot be used for any
purpose, being contaminated b y pollutants."], "abandoned vehicle":
["A vehicle that has been discarded in the envir'
As #tdelaney pointed out in comments, you are looking the wrong problem:
The error in not on open, nor in json.read, not in the data.items() iteractor, so it is not a decoding problem, between utf-8-sig and unicode string.
But the problem is in print. The problem is encoding, so it means that the error is between unicode string in python and a resulting binary encoding. In this case, the unicode string is converted into ascii, but ascii cannot represent all original characters.
There are two solutions:
you can allow your terminal to accept UTF-8 characters. This could by done by setting e.g. LANG=C.UTF-8 (supported only on few systems) or LANG=en_US.UTF-8 or other locales (check on your system what locales support UTF-8).
you force print to print only ascii, as proposed by #tdelaney print(k.encode('ascii', 'replace'), v.encode('ascii', 'replace')). You may want to change replace with backslashreplace or ignore. (see https://docs.python.org/3/library/codecs.html#codec-base-classes).
There are also hybrid solution (but not really safe and portable on all systems) (like forcing python to output in a particular encoding), and many hacks, but it will make your code complex, so I do not recommend it (and you should wrap them in a new function).

UnicodeDecodeError during encode?

We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
But of course, this works without any problems
>>> unicode(u'\xab')
u'\xab'
Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val):
r"""brute force coersion of str -> unicode"""
try:
return unicode(src)
except UnicodeDecodeError:
pass
res = u''
for ch in val:
res += unichr(ord(ch))
return res
But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'):
for fname in files:
filename = unicode(fname)
That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128)
UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
I think you're confusing Unicode strings and Unicode encodings (like UTF-8).
os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).
Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.
If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:
>>> "\xab".decode("cp850")
u'\xbd'
Then you can encode that into UTF-8 or any other encoding.
>>> _.encode("utf-8")
'\xc2\xbd'
If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:
"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.
for fname in files:
filename = unicode(fname)
The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').
I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:
for fname in files:
filename = fname.decode('<encoding>')
UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:
for fname in files:
filename = fname.decode('latin1') #which is synonym to #ISO-8859-1
Hope this helps!
As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Windows provides Unicode API to access filesystem so there shouldn't be this problem.
Python 2.7 (utf-8 filesystem on Linux):
>>> import os
>>> list(os.walk("."))
[('.', [], ['\xc3('])]
>>> list(os.walk(u"."))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \
ordinal not in range(128)
Python 3.3:
>>> import os
>>> list(os.walk(b'.'))
[(b'.', [], [b'\xc3('])]
>>> list(os.walk(u'.'))
[('.', [], ['\udcc3('])]
Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().
'\xab'
Is a byte, number 171.
u'\xab'
Is a character, U+00AB Left-pointing double angle quotation mark («).
u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.
To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.
>>> unicode('\xab')
UnicodeDecodeError
unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.
>>> unicode(u'\xab')
u'\xab'
Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.
res = u''
for ch in val:
res += unichr(ord(ch))
return res
The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:
return unicode(val, 'iso-8859-1')
(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)
One really horrible hack that occurs is to write our own str2uni() method
This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.
filename = unicode(fname)
So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:
filename = unicode(fname, sys.getfilesystemencoding())
Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:
for _,_,files in os.walk(u'/path/to/folder'): # note u'' string
for fname in files:
filename = fname # nothing more to do!
PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.

The confusion on python encoding

I retrieved the data encoded in big5 from database,and I want to send the data as email of html content, the code is like this:
html += """<tr><td>"""
html += unicode(rs[0], 'big5') # rs[0] is data encoded in big5
I run the script, but the error raised: UnicodeDecodeError: 'ascii' codec can't decode byte...... However, I tried the code in interactive python command line, there are no errors raised, could you give me the clue?
If html is not already a unicode object but a normal string, it is converted to unicode when it is concatenated with the converted version of rs[0]. If html now contains special characters you can get a unicode error.
So the other contents of html also need to be correctly decoded to unicode. If the special characters come from string literals, you could use unicode literals (like u"abcä") instead.
Your call to unicode() is working correctly. It is the concatenation, which is adding a unicode object to a byte string, that is causing trouble. If you change the first line to u'''<tr><td>''', (or u'<tr><td>') it should work fine.
Edit: This means your error lies in the data that is already in html by the time python reaches this snippet:
>>> '\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 0: ordinal not in range(128)
>>> u'\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
u'\x9f<tr><td>\u56a5'
>>>

Does python's print function handle unicode differently now than when Dive Into Python was written?

I'm trying to work my way through some frustrating encoding issues by going back to basics. In Dive Into Python example 9.14 (here) we have this:
>>> s = u'La Pe\xf1a'
>>> print s
Traceback (innermost last): File "<interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> print s.encode('latin-1')
La Peña
But on my machine, this happens:
>>> sys.getdefaultencoding()
'ascii'
>>> s = u'La Pe\xf1a'
>>> print s
La Peña
I don't understand why these are different. Anybody?
The default encoding for print doesn't depend on sys.getdefaultencoding(), but on sys.stdout.encoding. If you launch python with e.g. LANG=C or redirect a python script to a file, the encoding for stdout will be ANSI_X3.4-1968. On the other hand, if sys.stdout is a terminal, it will use the terminal's encoding.
To explain what sys.getdefaultencoding() does -- it's used when implicitly converting strings from/to unicode. In this example, str(u'La Pe\xf1a') with the default ASCII encoding would fail, but with modified default encoding it would encode the string to Latin-1. However setting the default encoding is a horrible idea, you should always use explicit encoding when you want to go from unicode to str.

Categories

Resources