Python - decoded unicode string does not stay decoded

Python - decoded unicode string does not stay decoded - python

It may be too late at night for me to be still doing programming (so apologies if this is a very silly thing to ask), but I have spotted a weird behaviour with string decoding in Python:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> name = bs.decode("utf-8", "replace")
>>> print(name)
I n t e l ( R )
>>> list_of_dict = []
>>> list_of_dict.append({'name': name})
>>> list_of_dict
[{'name': 'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00'}]
How can the list contain unicode characters if it has already been decoded?

Decoding bytes by definition produces "Unicode" (text really, where Unicode is how you can store arbitrary text, so Python uses it internally for all text), so when you say "How can the list contain unicode characters if it has already been decoded?" it betrays a fundamental misunderstanding of what Unicode is. If you have a str in Python 3, it's text, and that text is composed of a series of Unicode code points (with unspecified internal encoding; in fact, modern Python stores in ASCII, latin-1, UCS-2 or UCS-4, depending on highest ordinal value, as well as sometimes caching a UTF-8 representation, or a native wchar representation for use with legacy extension modules).
You're seeing the repr of the nul character (Unicode ordinal 0) and thinking it didn't decode properly, and you're likely right (there's nothing illegal about nul characters, they're just not common in plain text); your input data is almost certainly encoded in UTF-16-LE, not UTF-8. Use the correct codec, and the text comes out correctly:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> bs.decode('utf-16-le') # No need to replace things, this is legit UTF-16-LE
'Intel(R)'
>>> list_of_dict = [{'name': _}]
>>> list_of_dict
[{'name': 'Intel(R)'}]
Point is, while producing nul characters is legal, unless it's a binary file, odds are it won't have any, and if you're getting them, you probably picked the wrong codec.
The discrepancy between printing the str and displaying is as part of a list/dict is because list/dict stringify with the repr of their contents (what you'd type to reproduce the object programmatically in many cases), so the string is rendered with the \x00 escapes. printing the str directly doesn't involve the repr, so the nul characters get rendered as spaces (since there is no printable character for nul, so your terminal chose to render it as spaces).

So what I think is happening is that the null terminated characters \x00 are not properly decoded and remain in the string after decoding. However, since these are null characters they do not mess up when you print the string which interprets them as nothing or spaces (in my case I tested your code on arch linux on python2 and python3 and they were completely ommited)
Now the thing is that you got a \x00 character for each of your string characters when you decode with utf-8 so what this means is that your bytestream consists actually out of 16bit characters and not 8bit. Therefore, if you try to decode using utf-16 your code will work like a charm :)
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> t = bs.decode("utf-16", "replace")
>>> print(t)
Intel(R)
>>> t
'Intel(R)'

Related

Encoding unicode with 'utf-8' shows byte-strings only for non-ascii

I'm running python2.7.10
Trying to wrap my head around why the following behavior is seen. (Sure there is a reasonable explanation)
So I define two unicode characters, with only the first one in the ascii-set, and the second one outside of it.
>>> a=u'\u0041'
>>> b=u'\u1234'
>>> print a
A
>>> print b
ሴ
Now I encode it to see what the corresponding bytes would be. But only the latter gives me the results I am expecting to see (bytes)
>>> a.encode('utf-8')
'A'
>>> b.encode('utf-8')
'\xe1\x88\xb4'
Perhaps the issue is in my expectation, and if so, one of you can explain where the flaw lies.
- My a,b are unicodes (hex values of the ordinals inside)
- When I print these, the interpreter prints the actual character corresponding to each unicode byte.
- When I encode, I assumed that it would be converted into a byte-string using the encoding scheme I provide (in this case utf-8). I expected to see a bytestring for a.encode, just like I did for b.encode.
What am I missing?

There is no flaw. You encoded to UTF-8, which uses the same bytes as the ASCII standard for the first 127 codepoints of the Unicode standard, and uses multiple bytes (between 2 and 4) for everything else.
You then echoed that value in your terminal, which uses the repr() function to build a debugging representation. That representation produces a valid Python expression for strings, one that is ASCII safe. Any bytes in that value that is not printable as an ASCII character, is shown as an escape sequence. Thus UTF-8 bytes are shown as \xhh hex escapes.
Most importantly, because A is a printable ASCII character, it is shown as is; any code editor or terminal will accept ASCII, and for most English text showing the actual text is so much more useful.
Note that you used print for the unicode values stored in a and b, which means Python encodes those values to your terminal codec, coordinating with your terminal to produce the right output. You did not echo the values in the interpreter. Had you done so, you'd also seen debug output:
>>> a = u'\u0041'
>>> b = u'\u1234'
>>> a
u'A'
>>> b
u'\u1234'
In Python 3, the functionality of the repr() function (or rather, the object.__repr__ hook) has been updated to produce a unicode string with most printable codepoints left un-escaped. Use the new ascii() function to get the above behaviour.

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7

There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

How to convert a unicode character representation from string to unicode in python?

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?
My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.
Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)
All I want is the plain encoded string!
(yea no not at all past me. All you want is the unicode representation of a character as string)
PS: All in python if that wasn't obvious from the title already.
Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.
The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.
This did help but I just stumbled across this horribly written question and just couldn't leave it like that.

"\xe3" in python is a string literal that represents a single byte with value 227:
>>> print len("\xe3")
1
>>> print ord("\xe3")
227
This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):
>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163
This byte sequence is the UTF-8 encoding of the character "ã".
So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:
>>> "ã".decode("utf-8")
u'\xe3'
Now, you can take that unicode string and encode it however you like (e.g. into latin-1):
>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'

Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.
Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.

Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:
http://hexutf8.com/?q=U+e3
I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.
UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

Python Latin Characters and Unicode

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.

You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.