I'm trying to solve this issue when importing a CSV file. I try to save a string variable that contains latin-1 characters and when I try to print them, it changes it to an encoding. Is there anything I can do to keep the encoding? I simply want to keep the character as it is, nothing else.
Here's the issue (as seen from Django's manage shell
>>> variable = "{'job_title': 'préventeur'}"
>>> variable
"{'job_title': 'pr\xc3\xa9venteur'}"
Why does Django or Python automatically change the string? Do I have to change the characterset or something?
Anything will help. Thanks!
Your terminal is entering encoded characters; you are using UTF-8, and thus Python receives two bytes when you type é.
Decode from UTF-8 in that case:
>>> print 'pr\xc3\xa9venteur'.decode('utf8')
préventeur
You really want to read up on Python and Unicode though:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
"{'job_title': 'pr\xc3\xa9venteur'}"
The characters have been encoded into UTF-8 for you, which is pretty nice, because you don't want to stick with Latin-1 if you value your sanity. Convert to Unicode for best results:
>>> '\xc3\xa9'.decode('UTF-8')
u'é'
Have you tried using print statement instead?
>>> variable = "{'job_title': 'préventeur'}"
>>> variable
"{'job_title': 'pr\x82venteur'}"
>>> repr(variable)
'"{\'job_title\': \'pr\\x82venteur\'}"'
>>> print variable
{'job_title': 'préventeur'}
Related
what i have already known:
b'\xce\xb8'.decode('UTF-8') gives 'θ', because decode() function is designed for doing this job - decoding the bytes.
what i want to know is, dose python3 shell mode have some default config to control following behavior (Python3) .
>>> sys.getdefaultencoding()
'utf-8'
>>> b'\xce\xb8'.decode()
'θ'
>>> b'\xce\xb8'
b'\xce\xb8'
>>> b'\x41'
b'A'
>>> print(b'\xce\xb6')
b'\xce\xb6'
>>> print(b'\xce\xb6'.decode('utf8'))
ζ
it seems like shell mode use ASCII as default encoding rather than utf8.
the question is, is this true? if yes, what the path where the config is located in?
This has nothing to do with the encoding. Python is just showing you in the shell what the value is that you just gave it, in a more literal sense. Try this instead:
a = b'\xce\xb8'
print(a)
result:
θ
So 'a' is indeed encoded as UTF-8, just as you expected. You're just misinterpreting what Python is echoing back to the console.
BTW, you're also I think not doing what you think you are with the 'b' prefix. It appears you're using Python 2.X. In that version of Python, the 'b' prefix is ignored. I know that because it doesn't show up in the echoed result. See here:
Python 2.x:
>>> b'\xce\xb8'
'\xce\xb8'
Python 3.X
>>> b'\xce\xb8'
b'\xce\xb8'
So in Python 2.X, you'll get the same result with and without the 'b'. In Python 3.X, you get different behavior either way than what you get in Python 2.X. I haven't done much with Python 3.X, but I believe that this is because how strings are represented changed in 3.X.
PS: If you really just care how Python is echoing strings back to you, I don't know that there's a way to change that. I wonder, however, why that matters to you.
Python 3 represents bytes as the equivalent ASCII character if the value of the byte is within the ASCII range, otherwise it displays the escaped hex value.
From the docs for the byte type:
Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.
This is a deliberate design decision (from the same doc)
to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data
The interpreter doesn't display characters for bytes outside the ASCII range because it cannot know whether the bytes are encoded as UTF-8, some other encoding, or even if they represent text data at all.
As user Steve points out in their answer, this behaviour is not related to encoding. It is not configurable; if you want to see the characters corresponding to a UTF-8 encoded bytestring, decode to str.
I downloaded Spanish text from NLTK in python using
spanish_sents=nltk.corpus.floresta.sents()
when printing the sentences in the terminal the corresponding Spanish characters
are not rendered. For example printing spanish_sents[1] produces characters like u'\xe9' and if I encode it using utf-8 as in
print [x.encode("utf-8") for x in sapnish_sents[1]]
it produces '\xc3\xa9' and encoding in latin3
print [x.encode("latin3") for x in sapnish_sents[1]]
it produces '\xe9'
How can I configure my terminal to print the glyphs for these points? Thanks
Just an initial remark, Latin3 or ISO-8859-3 is indeed denoted as South European, but it was designed to cover Turkish, Maltese and Esperanto. Spanish is more commonly encoded in Latin1 (ISO-8859-1 or West European) or Latin9 (ISO-8859-15).
I can confirm that the letter é has the unicode code point U+00E9, and is represented as '\xe9' in both Latin1 and Latin3. And it is encoded as '\xc3\xc9' in UTF8, so all your conversions are correct.
But the real question How can I configure my terminal... ? is hard to answer without knowing what the terminal is...
if it is a true teletype or old vt100 without accented characters: you cannot (but I do not think you use that...)
if you use a Windows console, declare the codepage 1252 (very near to Latin1): chcp 1252 and use Latin1 encoding (or even better 'cp1252')
if you use xterm (or any derivative) on Linux or any other Unix or Unix-like, declare an utf8 charset with export LANG=en_US.UTF8 (choose your own language if you do not like american english, the interesting part here is .UTF8) and use UTF8 encoding - alternatively declare a iso-8859-1 charset (export LANG=en_US.ISO-8859-1) and use Latin1 encoding
What you are looking at, is the representation of strings, because printing lists is only for debugging purposes.
For printing lists, use .join:
print ', '.join(sapnish_sents[1])
My guess is that there are a few things going on. First, you're iterating through a str (is sapnish_sents[1] one entire entry? What happens when you print that). Second, you're not getting full characters because you're iterating through a str (a unicode character takes more "space" than an ASCII character, so addressing a single index will look weird). Third you are trying to encode when you probably mean to decode.
Try this:
print sapnish_sents[1].decode('utf-8')
I just ran the following in my terminal to help give context:
>>> a = '®†\¨ˆø' # Storing non-ASCII characters in a str is ill-advised;
# I do this as an example because it's what I think your question is
# really asking
>>> a # a now looks like a bunch of gibberish if I just output
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> print a # Well, this looks normal.
®†\¨ˆø
>>> print repr(a) # Just demonstrating how the above works
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> a[0] # We're only looking at one character, which is represented by all this stuff.
'\xc2'
>>> print a[0] # But because it's not a complete unicode character, the terminal balks
?
>>> print a.decode('utf-8') # Look familiar?
®†\¨ˆø
>>> print a.decode('utf-8')[0] # Our first character!
®
When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.
I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".
The following unicode and string can exist on their own if defined explicitly:
>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'
If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?
EDIT:
I did the following:
>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'
which fixes my issue. Can someone explain to me what exactly is happening?
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.
But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'
Then decode it correctly:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'
Now it is in the correct format.
However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.
You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""
In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:
Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.
Initial state: you have a unicode object that you have named u1. It contains e-acute:
>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'
You encode u1 as UTF-8 and name the result s:
>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'
You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.
>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>
Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').
Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.
If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:
>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'
Now it is a byte string that can be decoded correctly with utf8:
>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André
In one step:
>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André
value_uni.encode('utf8') or whatever encoding you need.
See http://docs.python.org/library/stdtypes.html#str.encode
The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:
v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))
The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.
Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.
To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').
You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Another excellent article (this time Python-specific): Unicode HOWTO
It seems like
str(value_uni)
should work... at least, it did when I tried it.
EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try
value_uni.encode('latin1')