Unicode character that doesnt work alone, Python - python

Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
prints two chinese characters, 不无, the correct ones. However, if I replace the first line with
uni = u"\u65E0"
ie only the second character, it prints æ— instead. Altough if I replace it with only the first character
u"\u4E0D"
it gives the correct output 不
Is this a bug, or what am I doing wrong?
COMPLETE CODE:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u4E0D"
binary = uni.encode("utf-8")
print binary
OUTPUT:
不无
æ— 
不

The unicode string u"\u4E0D\u65E0" consists of the two text characters 不 and 无.
When a unicode string is encoded, it is converted into a sequence of bytes (not binary). Depending on what encoding is used, there may not be a one-to-one mapping of text characters to bytes. The "utf8" encoding, for instance, can use from one to three bytes to represent a single character:
>>> u'\u65E0'.encode('utf8')
'\xe6\x97\xa0'
Now, before a sequence of bytes can be printed, python (or IDLE) has to try to decode it. But since it has no way to know what encoding was used, it is forced to guess. For some reason, it appears that IDLE may have wrongly guessed "cp1252" for one of the examples:
>>> text = u'\u65E0'.encode('utf8').decode('cp1252')
>>> text
u'\xe6\u2014\xa0'
>>> print text
æ— 
Note that there are three characters in text - the last one is a non-breaking space.
EDIT
Strictly speaking, IDLE wrongly guesses "cp1252" for all three examples. The second one only "succeeds" because each byte coincidently maps to a valid text character ("cp1252" is an 8-bit, single-byte encoding). The other two examples contain the byte \x8d, which is not defined in "cp1252". For these cases, IDLE (eventually) falls back to "utf8", which gives the correct output.

Related

Python - decoded unicode string does not stay decoded

It may be too late at night for me to be still doing programming (so apologies if this is a very silly thing to ask), but I have spotted a weird behaviour with string decoding in Python:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> name = bs.decode("utf-8", "replace")
>>> print(name)
I n t e l ( R )
>>> list_of_dict = []
>>> list_of_dict.append({'name': name})
>>> list_of_dict
[{'name': 'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00'}]
How can the list contain unicode characters if it has already been decoded?
Decoding bytes by definition produces "Unicode" (text really, where Unicode is how you can store arbitrary text, so Python uses it internally for all text), so when you say "How can the list contain unicode characters if it has already been decoded?" it betrays a fundamental misunderstanding of what Unicode is. If you have a str in Python 3, it's text, and that text is composed of a series of Unicode code points (with unspecified internal encoding; in fact, modern Python stores in ASCII, latin-1, UCS-2 or UCS-4, depending on highest ordinal value, as well as sometimes caching a UTF-8 representation, or a native wchar representation for use with legacy extension modules).
You're seeing the repr of the nul character (Unicode ordinal 0) and thinking it didn't decode properly, and you're likely right (there's nothing illegal about nul characters, they're just not common in plain text); your input data is almost certainly encoded in UTF-16-LE, not UTF-8. Use the correct codec, and the text comes out correctly:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> bs.decode('utf-16-le') # No need to replace things, this is legit UTF-16-LE
'Intel(R)'
>>> list_of_dict = [{'name': _}]
>>> list_of_dict
[{'name': 'Intel(R)'}]
Point is, while producing nul characters is legal, unless it's a binary file, odds are it won't have any, and if you're getting them, you probably picked the wrong codec.
The discrepancy between printing the str and displaying is as part of a list/dict is because list/dict stringify with the repr of their contents (what you'd type to reproduce the object programmatically in many cases), so the string is rendered with the \x00 escapes. printing the str directly doesn't involve the repr, so the nul characters get rendered as spaces (since there is no printable character for nul, so your terminal chose to render it as spaces).
So what I think is happening is that the null terminated characters \x00 are not properly decoded and remain in the string after decoding. However, since these are null characters they do not mess up when you print the string which interprets them as nothing or spaces (in my case I tested your code on arch linux on python2 and python3 and they were completely ommited)
Now the thing is that you got a \x00 character for each of your string characters when you decode with utf-8 so what this means is that your bytestream consists actually out of 16bit characters and not 8bit. Therefore, if you try to decode using utf-16 your code will work like a charm :)
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> t = bs.decode("utf-16", "replace")
>>> print(t)
Intel(R)
>>> t
'Intel(R)'

Write Unicode character with integer value to text file in Python 2

In Python 2, I want to write a Unicode character which integer value is k to text file.
How should I do that?
(For instance, with ASCII, if I want to write the character with value 65, in text file it should appeared as 'A').
Afterwards, how should I read the file back to integer value?
The last question, how many Unicode characters are there in total? (as I know, there are more than one Unicode alphabets, such as UTF-8, UTF-16, etc.)
Thanks a lot
You can't write Unicode code points to text files. They must be encoded. UTF-8, UTF-16 and UTF-32 are encodings that support the full range of Unicode code points. unichr() is the function to turn an integer into a Unicode codepoint. Note that Python 2 will default to an encoding that depends on your operating system if you don't specify one, but it won't be able to write all Unicode characters unless that default is one of the UTF encodings.
Create a Unicode character:
k = 65
u = unichr(k)
Write it to a file encoded in UTF-8:
import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(u)
ord() will convert a character back to an integer.
Example (make sure to open with the same encoding as written):
import io
with io.open('output.txt',encoding='utf8') as f:
u = f.read()
k = ord(u)
Unicode code points range from U+0000 to U+10FFFF. Not all code points are defined, but there are 1,114,112 possible values in that range.

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.
I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.
I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

Encoding unicode with 'utf-8' shows byte-strings only for non-ascii

I'm running python2.7.10
Trying to wrap my head around why the following behavior is seen. (Sure there is a reasonable explanation)
So I define two unicode characters, with only the first one in the ascii-set, and the second one outside of it.
>>> a=u'\u0041'
>>> b=u'\u1234'
>>> print a
A
>>> print b
ሴ
Now I encode it to see what the corresponding bytes would be. But only the latter gives me the results I am expecting to see (bytes)
>>> a.encode('utf-8')
'A'
>>> b.encode('utf-8')
'\xe1\x88\xb4'
Perhaps the issue is in my expectation, and if so, one of you can explain where the flaw lies.
- My a,b are unicodes (hex values of the ordinals inside)
- When I print these, the interpreter prints the actual character corresponding to each unicode byte.
- When I encode, I assumed that it would be converted into a byte-string using the encoding scheme I provide (in this case utf-8). I expected to see a bytestring for a.encode, just like I did for b.encode.
What am I missing?
There is no flaw. You encoded to UTF-8, which uses the same bytes as the ASCII standard for the first 127 codepoints of the Unicode standard, and uses multiple bytes (between 2 and 4) for everything else.
You then echoed that value in your terminal, which uses the repr() function to build a debugging representation. That representation produces a valid Python expression for strings, one that is ASCII safe. Any bytes in that value that is not printable as an ASCII character, is shown as an escape sequence. Thus UTF-8 bytes are shown as \xhh hex escapes.
Most importantly, because A is a printable ASCII character, it is shown as is; any code editor or terminal will accept ASCII, and for most English text showing the actual text is so much more useful.
Note that you used print for the unicode values stored in a and b, which means Python encodes those values to your terminal codec, coordinating with your terminal to produce the right output. You did not echo the values in the interpreter. Had you done so, you'd also seen debug output:
>>> a = u'\u0041'
>>> b = u'\u1234'
>>> a
u'A'
>>> b
u'\u1234'
In Python 3, the functionality of the repr() function (or rather, the object.__repr__ hook) has been updated to produce a unicode string with most printable codepoints left un-escaped. Use the new ascii() function to get the above behaviour.

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua

Categories

Resources