len(unicode string) - python

>>> c='中文'
>>> c
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(c)
6
>>> cu=u'中文'
>>> cu
u'\u4e2d\u6587'
>>> len(cu)
2
>>> s='𤭢'
>>> s
'\xf0\xa4\xad\xa2'
>>> len(s)
4
>>> su=u'𤭢'
>>> su
u'\U00024b62'
>>> len(su)
2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'
First, I want to make some concepts clear myself.
I've learned that unicode string like cu=u'中文' ,actually is encoded in UTF-16 by python shell default. Right? So, when we saw '\u*' , that actually UTF-16 encoding? And '\u4e2d\u6587' is an unicode string or byte string? But cu has to be stored in the memory, so
0100 1110 0010 1101 0110 0101 1000 0111
(convert \u4e2d\u6587 to binary) is the form that cu preserved if that a byte string? Am I right?
But it can't be byte string. Otherwise len(cu) can't be 2, it should be 4!!
So it has to be unicode string. BUT!!! I've also learned that
python attempts to implicitly encode the Unicode string with whatever
scheme is currently set in sys.stdout.encoding, in this instance it's
"UTF-8".
>>> cu.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
So! how could len(cu) == 2??? Is that because there are two '\u' in it?
But that doesn't make len(su) == 2 sense!
Am I missing something?
I'm using python 2.7.12

The Python unicode type holds Unicode codepoints, and is not meant to be an encoding. How Python does this internally is an implementation detail and not something you need to be concerned with most of the time. They are not UTF-16 code units, because UTF-16 is another codec you can use to encode Unicode text, just like UTF-8 is.
The most important thing here is that a standard Python str object holds bytes, which may or may not hold text encoded to a certain codec (your sample uses UTF-8 but that's not a given), and unicode holds Unicode codepoints. In an interactive interpreter session, it is the codec of your terminal that determines what bytes are received by Python (which then uses sys.stdin.encoding to decode these as needed when you create a u'...' unicode object).
Only when writing to sys.stdout (say, when using print) does the sys.stdout.encoding value come to play, where Python will automatically encode your Unicode strings again. Only then will your 2 Unicode codepoints be encoded to UTF-8 again and written to your terminal, which knows how to interpret those.
You probably want to read up about Python and Unicode, I recommend:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Related

How encode strings using python

I have lists with element
[u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
[u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
I try to convert it using
val[0].encode('utf-8')
And got after it
мÑÑ Ð¾ÑиÑиалÑнÑй ÑайÑ
мÑÑ Ð¾ÑиÑиалÑнÑй ÑайÑ
What I do wrong?
You have a Mojibake; text decoded using the wrong codec.
You have what looks like it was decoded or Latin-1 or Windows codepage 1252, while it should have been decoded as UTF-8 instead.
Either reverse the encoding manually, or use the excellent ftfy package to do it for you:
>>> import ftfy
>>> data = [u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
>>> ftfy.ftfy(data[0])
u'\u043c\u0442\u0441 \u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u0441\u0430\u0439\u0442'
>>> print ftfy.ftfy(data[0])
мтс официальный сайт
Manually, you'd re-encode as Latin-1:
>>> data[0].encode('latin1')
'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82'
>>> data[0].encode('latin1').decode('utf8')
u'\u043c\u0442\u0441 \u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u0441\u0430\u0439\u0442'
>>> print data[0].encode('latin1').decode('utf8')
мтс официальный сайт
Note that you have a list with one unicode object in it. You may want to study up on Python and Unicode; I recommend the following documents:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
These will help you understand when to encode and when to decode, and what codec to use.

Python UnicodeDecodeError when writing German letters

I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.
Here's my code:
f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
And what I get as en error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Why ascii if I say utf-8? I would really appreciate any help.
Try:
value = u"Bitte überprüfen"
in order to declare value as a unicode string and
# -*- coding: utf-8 -*-
at the start of your file in order to declare that your python file is saved with utf-8 encoding.
For the sake of never being hurt by unicode errors ever again, switch to python3:
% python3
>>> with open('/tmp/foo', 'w') as f:
... value = "Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value)))
...
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";
though if you're really tied to python2 and have no choice:
% python2
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
...
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
And as #JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!
Your error in:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.
As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:
>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";
Still not convinced to switch to python3 ? :-)
Update:
This is the way to go to read and write an unicode string from a file to another file:
% echo "Bitte überprüfen" > /tmp/foobar
% python2
>>> with open('/tmp/foobar', 'r') as f:
... data = f.read().decode('utf-8').strip()
...
>>>
>>> with open('/tmp/foo2', 'w') as f:
... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
...
>>> import sys;sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
Update:
as a general rule:
when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data
Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
HTH
Like already suggested your error results from this line:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
it should be:
f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
A note on unicode and encodings
If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
Why ascii if I say utf-8?
Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.
The u prefix shown by #JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).
Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
I tried a method like this:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
It returns Ã¥ instead of å. Same goes for other letters like æ and ø.
Can anyone please explain why this is happening?
Thanks
I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).
My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua

noob queries on unicode and str methods in Python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

Categories

Resources