Given a string
u ='abc'
which syntax is the right one to encode into utf8?
u.encode('utf-8')
or
u.encode('utf8')
And how do I know that I have already encoded in utr-8?
First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions.
Python 2
unicode type contains text characters
str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding
s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. It goes from str to unicode, for example "Citt\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\xc3\xa0".decode("utf-8"). The encoding may be omitted and in that case the meaning is "use the default encoding".
u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode. It goes from unicode to str. As above the encoding can be omitted.
Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically.
For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified".
Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding".
For example if I write
u"Citt\u00e0".decode("utf-8")
Python gives as error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
position 3: ordinal not in range(128)
NOTE: the error is about encoding that failed, while I asked for decoding. The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer.
More confusion is that in Python 2 str is used for unencoded bytes, but it's also used everywhere for text and for example string literals are str objects.
Python 3
To solve some of the issues Python 3 made a few key changes
str is for text and contains unicode characters, string literals are unicode text
unicode type doesn't exist any more
bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding
For example in Python 3
'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8') → b'Citt\xc3\xa0'
also you cannot call decode on text strings and you cannot call encode on byte sequences.
Failures
Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. For example iso8859-1 cannot handle Chinese. These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else.
The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8).
Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix.
There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters.
Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding.
First solution:
isinstance(u, unicode)
Second solution:
try:
u.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
Related
I know that using this code can remove the b prefix
>>> b'Hello'
b'Hello'
>>> b'Hello'.decode() # decodes bytes type
'Hello'
But if I use a unicode escape (with unicode_escape codec because the utf-8 codec struggles), it works fine... [Python 3.10, Windows 10, AMD64]
>>> b'Hello\xeb'.decode() # utf-8 codec does not work that well
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
b'Hello\xeb'.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 5: unexpected end of data
>>> b'Hello\xeb'.decode('unicode_escape')
'Helloë'
...but for example if I use a .exe file it does not work (and the b prefix is still there??)
>>> # some thing that reads file, code: with open('autoclicker.exe', 'rb') ...
b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00\x00\x00\x00\x00#\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x0e\x1f\xba\x0e\x00\xb4\t\xcd!\xb8\x01L\xcd!This program cannot be run in DOS mode.\r\r\n$\x00\x00\x00\x00\x00\x00\x00-\x82\xc1\xedi\xe3\xaf\xbei\xe3\xaf\xbei\xe3\xaf\xbe\xd4\xac9\xbek\xe3\xaf\xbe`\x9b:\xbew\xe3\xaf\xbe`\x9b,\xbe\xdb\xe3\xaf\xbe`\x9b+\xbeP\xe3\xaf\xbeN%\xc2\xbec\xe3\xaf\xbeN%\xd4\xbeH\xe3\xaf\xbei\xe3\xae\xbed\xe1\xaf\xbe`\x9b \xbe/\xe3\xaf\xbew\xb1:\xbek\xe3\xaf\xbew\xb1;\xbeh\xe3\xaf\xbei\xe38\xbeh\xe3\xaf\xbe`\x9b>\xbeh\xe3\xaf\xbeRichi\xe3\xaf\xbe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00PE\x00\x00L\x01\x04\x00\x14\xb3\x8a\\\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x00#\x01\x0b\x01\t\x00\x00\x02\x08\x00\x00\xf0\x01\x00\x00\x00\x00\x00\x10c\x01\x00\x00\x10\x00\x00\x00 \x08\x00\x00\x00#\x00\x00\x10\x00\x00\x00\x02\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00`\x0b\x00\x00\x04\x00\x00\x03\t\x0e\x00\x02\x00\x00\x80\x00\x00#\x00\x00\x10\x00\x00\x00\x00#\x00\x00\x10\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00<\xcd\x08\x00T\x01\x00\x00\x00\xb0\n\x00\xfc\xac\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x08\x00#\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00.text\x00\x00\x00\x17\x00\x08\x00\x00\x10\x00\x00\x00\x02\x08\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00`.rdata\x00\x00\\\xd9\x00\x00\x00 \x08\x00\x00\xda\x00\x00\x00\x06\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00#.data\x00\x00\x00\x18\xa5\x01\x00\x00\x00\t\x00\x00h\x00\x00\x00\xe0\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00\xc0.rsrc\x00\x00\x00\xfc\xac\x00\x00\x00\xb0\n\x00\x00\xae\x00\x00\x00H\t\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00#\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x003\xc0\x81\xec\xac\x03\x00\x008\x05\x92rI\x00tCh\xa4\x03\x00\x00P\x8dT$\x0cR\xa2\x92rI\x00\x89\x81\x8c\x01\x00\x00\xe8\xc5!\x01\x00\xa1\xc0rI\x00\x83\xc4\x0c\x8d\x0c$Qj\x02\xc7D$\x08\xa8\x03\x00\x00\x89D$\x0c\xc7D$\x10\x01\x00\x00\x00\xff\x15\x8c$H\x00\x81\xc4\xac\x03\x00\x00\xc3\xcc\xcc\xcc\xcc\xcc\xcc\x8bF$S3\xdb;\xc3\x0f\x85J\x85\x02\x00\x8bF,\x89^$;\xc3\x0f\x85J\x85\x02\x00\x89^,\x89^0\x89^4\x89^8\x88^\x10[\xc3\xcc\xcc\xcc\x80~\t\x00\x0f\x85\x95\x82\x02\x00j\x08\xe8y\x06\x01\x00\x83\xc4\x04\x85\xc0t\x10\x8b\x17\x89\x10\x8bN\x04\x89H\x04\xff\x06\x89F\x04\xc3\x8bN\x043\xc0\x89H\x04\xff\x06\x89F\x04\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x8bD$\x10\x8bL$\x0cS\x8b\\$\x08W\x8b|$\x10PQ\xe8\x07\x00\x00\x00_[\xc2\x10\x00\xcc\xccU\x8b\xec\x83\xe4\xf8QV;\x1d\xc0rI\x00uP\x81\xff\x11\x01\x00\x00s(\x83\xff\x12r#;=\x80\x93J\x00\x0f\x84r\xd6\x02\x00\x8bU\x0c\x8bE\x08RPWS\xff\x15\x84&H\x00^\x8b\xe5]\xc2\x08\x00\x81\xff\x13\x01\x00\x00u#\x8bE\x08h\xb8\x84J\x00\x8b\xf3\xe8\xe9\x00\x00\x003\xc0^\x8b\xe5]\xc2\x08\x00\xa1\xc0rI\x00\x85\xc0t\xa7\xeb\xbe\x83\xff\x10w9\x0f\x84\xa5\xd5\x02\x00\x8dG\xff\x83\xf8\x06w\x9f\xff$\x85\x14\x12#\x00j\x01S\xff\x15\xd0&H\x00\xb9\xb8\x84J\x00\xe8x\xfe\xff\xffj\x00\xff\x15\xcc&H\x003\xc0^\x8b\xe5]\xc2\x08\x00\x81\xff\x12\x03\x00\x00wa\x0f\x84\xc9\xd5\x02\x00\x83\xff\x11\x0f\x84\x90\xd5\x02\x00\x81\xff\x11\x01\x00\x00\x0f\x85Q\xff\xff\xff\xe9`\xd5\x02\x00j\x00h\xee\x02\x00\x00j\x01S\xff\x15\xdc&H\x00hhHH\x00\xff\x15\xd8&H\x00\x83=\xb8\x84J\x00\x00\xa3\x80\x93J\x00\x0f\x85Y\xff\xff\xff\xff\x15\xd4&H\x00\xa3\xb8\x84J\x003\xc0^\x8b\xe5]\xc2\x08\x00\x81\xff\x01\x04\x00\x00\x0f\x85\xff\xfe\xff\xff\xe9\x9e\xd5\x02\x00\x90\xc1\x11#\x00u\x11#\x00\r\x11#\x00\r\x11#\x00\xe7\xe6B\x00\r\x11#\x00\xd6\xe6B\x00\x81\xec\xa8\x03\x00\x00\x83\xe8\x01SW\x0f\x85\x84\x00\x00\x00h\xa4\x03\x00\x00P\x8dX\x01\x8dD$\x14P\xc7D$\x14\xa8\x03\x00\x00\xe8\x94\x1f\x01\x00\x8b\x84$\xc0\x03\x00\x00\x83\xc4\x0c\xe8\x05\x0c\x00\x00\x80=\x92rI\x00\x00t:\x80=\x94rI\x00\x00\x8b\xbc$\xb4\x03\x00\x00\x89t$\x0c\x89\\$\x10\xc7D$\x14\x02\x00\x00\x00\x0f\x85\xc9\x97\x02\x00\x80\x7f\t\x00\x0f\x85\n\x98\x02\x008\x9f\x84\x01\x00\x00\x0f\x84J\x98\x02\x00SV\xff\x15\xd0&H\x00j\x00h\xee\x02\x00\x00SV\xff\x15\xdc&H\x00_[\x81\xc4\xa8\x03\x00\x00\xc2\x04\x00\x8b\x03\x85\xc0t\tP\xe8\x13\x00\x01\x00\x83\xc4\x04V\x8d\xb3\xec\x00\x00\x00W\xc7\x06p\xa0H\x00\xe8\x1e\xf5\x00\x00\x8bF\x04P\xe8\xf4\xff\x00\x00\x83\xc4\x04\x8d\x8b\xbc\x00\x00\x00\xe8\x17\x13\x00\x00\x8d{x\xe8O\x00\x00\x00\x8d{4\xe8G\x00\x00\x00\x8dK$\xe8\xff\x12\x00\x00_\x8dK\x14^\xe9\xf5\x12\x00\x00\xcc\xcc\xcc\xcc\xcc\x8bF\x0c\xff\x08\x8bF\x0c\x838\x00u\x14\x8b\x0eQ\xe8\xaa\xff\x00\x00\x8bV\x0cR\xe8\xa1\xff\x00\x00\x83\xc4\x08\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xccV\x8b\xf7\xe8\xf8\xfc\xff\xff\x8dw\x14\xe8\xc0\xff\xff\xff\x8b\xf7\xe8\xb9\xff\xff\xff^\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x8b\x06\x85\xc0t\tP\xe8c\xff\x00\x00\x83\xc4\x04W\x8d\xbe\xec\x00\x00\x00\xe8\xc5\xff\xff\xff\x8d\x8e\xbc\x00\x00\x00\xe8z\x12\x00\x00\x8d\x8e\xac\x00\x00\x00\xe8o\x12\x00\x00\x8d\x8e\x9c\x00\x00\x00\xe8d\x12\x00\x00\x8d\x8e\x8c\x00\x00\x00\xe8Y\x12\x00\x00\x8d~\x08\xe8\xd1\xf0\x00\x00_\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xccj\x04\xe83\x03\x01\x00\x83\xc4\x04\x85\xc0\x0f\x84\xc2\x82\x02\x00\xc7\x00\x01\x00\x00\x00\x89F\x0c\xc3\xcc\xcc\xcc\xccV\x8b\xf1#3\xc9\x89F\x08\xba\x02\x00\x00\x00\xf7\xe2\x0f\x90\xc1\xc7F\x04\x00\x00\x00\x00\xf7\xd9\x0b\xc8Q\xe8\xf6\x02\x01\x003\xc9\x89\x06\x83\xc4\x04f\x89\x08\xe8\xad\xff\xff\xff\x8b\xc6^\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x81\xec\x14\x02\x00\x00\xe8\x85\x13\x00\x00\x84\xc0\x0f\x84n\xae\x02\x00\xe8\x98\x00\x00\x00\x85\xc0\x0f\x85a\xae\x02\x00\xe8k\xc1\x00\x00\x85\xc0\x0f\x85T\xae\x02\x00\x8b\x94$\x18\x02\x00\x00\x8d\x04$P\x8dL$\x08Qh\x04\x01\x00\x00R\xff\x15 #H\x00\x8dD$\x04P\xb8\xe8\x7fJ\x00\xe8H\r\x00\x00\x8b\x0c$Q\xb8\xd8\x7fJ\x00\xe8:\r\x00\x00\x8b\x04$3\xd2f\x89P\xfef9T$\x08\x0f\x84\x11\xae\x02\x00\x8dT$\x04R\xb8\xf8\x7fJ\x00\xe8\x17\r\x00\x00\x8b\x84$\x1c\x02\x00\x00\xa3\xd4\x7fJ\x003\xc0\x81\xc4\x14\x02\x00\x00\xc2\x08\x00\x8b#\x04\x85\xc0\x0f\x85\xb4\x81\x02\x00\xc3\xcc\xcc\xcc\xcc\x83\xect3\xc0SUVW\xbd\x01\x00\x00\x00\x89D$\x18\x89D$,\x89D$0\x89D$$\x89D$ \x89D$\x1c\x89D ...
>>> b'hi\xbd'.decode('unicode_escape') # using some (unicode) escape methods work separately
'hi½'
>>> b'abc\'
The b'' isn't a "string prefix", instead it indicates that you are dealing with a sequence of bytes. Bytes can represent anything, including a text which is just a series of characters in some encoding, like UTF-8, ASCII, etc.
That's what .decode() does, it takes the sequence of bytes and interprets it as if it were a string of characters in that encoding and returns a string of those characters. Conversely, you could then encode the resulting string of characters into some other encoding by calling .encode() on the string and you'd get the sequence of bytes that represents that string in that encoding.
However, you can't just take any sequence of bytes and 'decode' it as any decoding - the bytes will have a certain encoding if they represent some string, but the example you give (of an executable) doesn't represent a string of characters at all and thus won't successfully decode into a string if you just call .decode() on it.
If you're lucky, the decoding works on the parts of the executable that are strings in that encoding, but even that's not guaranteed to work, as the strings will be surrounded by bytes that don't represent that encoding.
If you want to extract strings from an executable, you need to correctly identify what parts of the executable represent strings, extract those sequences of bytes and decode them with the correct encoding. How to do that will depend on the operating system the executable is for, whether it's 32-bit or 64-bit, etc.
Note: many programmers new to Python or coding in general get confused by the fact that Python (for the sake of convenience) shows you a bytes object as very similar to a string (it looks just like string with a b before it), this is even more confusing if it happens to be an encoding that's UTF or very similar, as the contents of the bytes object will even be readable then. But that doesn't mean the bytes objects actually is a string.
So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
The returned message is then dumped into a JSON and send to the front end.
I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)
You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.
My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.
When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.
(Python 3 no longer does this as it is terribly confusing.)
Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.
decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors
text encodings means A "codec which encodes Unicode strings to bytes."
maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.
message.decode(encoding='utf-8', errors="ignore")
I am trying to load a text file, which contains some German letters with
content=open("file.txt","r").read()
which results in this error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
if I modify the file to contain only ASCII characters everything works as expected.
Apperently using
content=open("file.txt","rb").read()
or
content=open("file.txt","r",encoding="utf-8").read()
both do the job.
Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?
In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.
ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.
When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character à is now read in without error. But despite it seeming to work, it's still not "correct".
If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.
It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.
A character like this: č
Will be read in as two bytes, but properly decoded, will be one character:
bytes = bytes('č', encoding='utf-8')
print(len(bytes)) # 2
print(len(bytes.decode('utf-8'))) # 1
I have a UTF8 String piped from Java to python.
The end result is
'\xe0\xb8\x9a\xe0\xb8\x99'
Hence for example
a = '\xe0\xb8\x9a\xe0\xb8\x99'
a.decode('utf-8')
gives me the result
u'\u0e1a\u0e19'
however, what i am curious is since the bytes is piped in as UTF-8, why would be
'\xe0\xb8\x9a\xe0\xb8\x99'
instead of u'\u0e1a\u0e19'.
If i were to encode (u'\u0e1a\u0e19') i would get back '\xe0\xb8\x9a\xe0\xb8\x99'.
So what is the inherent difference between these two and how i do actually understand when to use decode and encode.
UTF8 String is insufficient to describe the statement '\xe0\xb8\x9a\xe0\xb8\x99' is; it really should be called UTF8 encoding of a unicode string.
Python 2's unicode type and Python 3's str type represents a string of unicode code points, so the statement u'\u0e1a\u0e19' is the python representation of the two code points U+0E1A U+0E19 and in human terms it will be rendered as บน.
As for explaining the whole encode and decode calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode '\xe0\xb8\x9a\xe0\xb8\x99' as a utf-8 encoded input in order to get that back into what unicode code points they represent (which is u'\u0e1a\u0e19'). Calling encode on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str type and Python 3 it will be actually be the bytes type) will get back to the series of bytes that is '\xe0\xb8\x9a\xe0\xb8\x99'.
Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\xff\xfe\x1a\x0e\x19\x0e', or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11 encoding for this, which will be encoded into the bytes '\xba\xb9' - but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\xba\xb9' could be decoded using the iso8859-1 encoding which would be rendered as º¹ or iso8859-11 as บน.
In short, '\xe0\xb8\x9a\xe0\xb8\x99' is the UTF8 encoding of the unicode code points for u'\u0e1a\u0e19' in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.
Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
'\xe0\xb8\x9a\xe0\xb8\x99' is simply a series of bytes. You have chosen to interpret that as UTF-8, and when you do, you can decode it into a series of unicode characters, U+e1a and U+e19.
The sequence U+e1a, U+e19 can be represented as u'\u0e1a\u0e19', but in some sense that representation is as arbitrary as '\xe0\xb8\x9a\xe0\xb8\x99'. It is "natural", that's why Python prints them that way, but it's inefficent, which is why there are various other encoding schemes, including UTF-8
In fact, it's slightly misleading for me to say "'\xe0\xb8\x9a\xe0\xb8\x99' is a series of bytes." It is the default representation of a series of bytes, two hundred twenty-four, followed by one hundred eighty-four, and so on.
Python has a notion of a series of bytes, and it has a separate notion of series of unicode characters. encode and decode represent one way of mapping between those two notions.
Does that help?
Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.