UTF8 Encoding and decoding in python - python

I have a UTF8 String piped from Java to python.
The end result is
'\xe0\xb8\x9a\xe0\xb8\x99'
Hence for example
a = '\xe0\xb8\x9a\xe0\xb8\x99'
a.decode('utf-8')
gives me the result
u'\u0e1a\u0e19'
however, what i am curious is since the bytes is piped in as UTF-8, why would be
'\xe0\xb8\x9a\xe0\xb8\x99'
instead of u'\u0e1a\u0e19'.
If i were to encode (u'\u0e1a\u0e19') i would get back '\xe0\xb8\x9a\xe0\xb8\x99'.
So what is the inherent difference between these two and how i do actually understand when to use decode and encode.

UTF8 String is insufficient to describe the statement '\xe0\xb8\x9a\xe0\xb8\x99' is; it really should be called UTF8 encoding of a unicode string.
Python 2's unicode type and Python 3's str type represents a string of unicode code points, so the statement u'\u0e1a\u0e19' is the python representation of the two code points U+0E1A U+0E19 and in human terms it will be rendered as บน.
As for explaining the whole encode and decode calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode '\xe0\xb8\x9a\xe0\xb8\x99' as a utf-8 encoded input in order to get that back into what unicode code points they represent (which is u'\u0e1a\u0e19'). Calling encode on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str type and Python 3 it will be actually be the bytes type) will get back to the series of bytes that is '\xe0\xb8\x9a\xe0\xb8\x99'.
Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\xff\xfe\x1a\x0e\x19\x0e', or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11 encoding for this, which will be encoded into the bytes '\xba\xb9' - but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\xba\xb9' could be decoded using the iso8859-1 encoding which would be rendered as º¹ or iso8859-11 as บน.
In short, '\xe0\xb8\x9a\xe0\xb8\x99' is the UTF8 encoding of the unicode code points for u'\u0e1a\u0e19' in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.
Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

'\xe0\xb8\x9a\xe0\xb8\x99' is simply a series of bytes. You have chosen to interpret that as UTF-8, and when you do, you can decode it into a series of unicode characters, U+e1a and U+e19.
The sequence U+e1a, U+e19 can be represented as u'\u0e1a\u0e19', but in some sense that representation is as arbitrary as '\xe0\xb8\x9a\xe0\xb8\x99'. It is "natural", that's why Python prints them that way, but it's inefficent, which is why there are various other encoding schemes, including UTF-8
In fact, it's slightly misleading for me to say "'\xe0\xb8\x9a\xe0\xb8\x99' is a series of bytes." It is the default representation of a series of bytes, two hundred twenty-four, followed by one hundred eighty-four, and so on.
Python has a notion of a series of bytes, and it has a separate notion of series of unicode characters. encode and decode represent one way of mapping between those two notions.
Does that help?

Related

How do I check whether have encoded in utf-8 successfully

Given a string
u ='abc'
which syntax is the right one to encode into utf8?
u.encode('utf-8')
or
u.encode('utf8')
And how do I know that I have already encoded in utr-8?
First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions.
Python 2
unicode type contains text characters
str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding
s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. It goes from str to unicode, for example "Citt\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\xc3\xa0".decode("utf-8"). The encoding may be omitted and in that case the meaning is "use the default encoding".
u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode. It goes from unicode to str. As above the encoding can be omitted.
Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically.
For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified".
Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding".
For example if I write
u"Citt\u00e0".decode("utf-8")
Python gives as error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
position 3: ordinal not in range(128)
NOTE: the error is about encoding that failed, while I asked for decoding. The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer.
More confusion is that in Python 2 str is used for unencoded bytes, but it's also used everywhere for text and for example string literals are str objects.
Python 3
To solve some of the issues Python 3 made a few key changes
str is for text and contains unicode characters, string literals are unicode text
unicode type doesn't exist any more
bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding
For example in Python 3
'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8') → b'Citt\xc3\xa0'
also you cannot call decode on text strings and you cannot call encode on byte sequences.
Failures
Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. For example iso8859-1 cannot handle Chinese. These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else.
The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8).
Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix.
There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters.
Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding.
First solution:
isinstance(u, unicode)
Second solution:
try:
u.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"

Write a unicode character to a file in a binary way

I have the following code to write an ASCII "#" character to a file in a binary fashion:
fin=open('a.bin','wb')
fin.write('\x40')
fin.close()
It turns out the a "#" character has been written to "a.bin", which has a length of 1-byte.
However, when I tried to write a unicode character instead:
fin=open('a.bin','wb')
fin.write(u'\x40')
fin.close()
It turned out that "a.bin" is still 1-byte long. I thought it should be 2-byte long since a unicode character takes 2-bytes. There may be some trivial thing that I overlooked.
You are confusing Unicode with encodings. An encoding is a standard that represents text as within the confines of individual values in the range of 0-255 (bytes), while Unicode is a standard that describes codepoints representing textual glyphs. The two are related but not the same thing.
The Unicode standard includes several encodings. UTF-16 is one such encoding that uses 2 bytes per codepoint, but it is not the only encoding included in the standard. UTF-8 is another such encoding, and it uses a variable number of bytes per codepoint.
Your file, however, is written using ASCII, the default codec used by Python 2 when you do not specify an explicit encoding. If you expected to see 2 bytes per codepoint, encode to UTF-16 explicitly:
fin.write(u'\x40'.encode('utf16-le')
This writes UTF-16 in little endian byte order; there is also a utf16-be codec. Normally, for multi-byte encodings like UTF-16 or UTF32, you'd also include a BOM, or Byte Order Mark; it is included automatically when you write UTF-16 without picking any endianes.
fin.write(u'\x40'.encode('utf16')
I strongly urge you to study up on Unicode, codecs and Python before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
consequence is that a plain ASCII string is also a valid UTF-8
string.
UTF-8, a transformation format of ISO 10646
Martijn is right in his elaborate answer: Learn more about Unicode first. But a smaller answer than reading large educational documents can be this:
When writing a Python unicode value (u'\x40' in your case) to a stream (an open file in your case), this abstract unicode value must be converted into a concrete stream of bytes. For this encodings are used.
You can do this explicitly (by using u'\x40'.encode('foo')) or you do it implicitly; then some encoding is being used. In your case either "ascii" or "utf8" which both represent a unicode-# as a single byte with value 40.
What you seem to want is using an encoding in which the unicode-# is represented as a two-byte value; that would be the encoding utf-16 for instance.

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Python: what does "...".encode("utf8") fix?

I wanted to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).
My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.
Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.
You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.
What does .encode("utf8") do?
It depends on which version of Python you're using:
In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.
"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.
It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).
The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Categories

Resources