Quite a silly question, I know. Of course normally LANG=C indicates an ASCII terminal
which cannot display Unicode characters. But I nevertheless want to print out the UTF-8 bytes. I use Python 2 (2.6.5 actually)
print '\xc3\xa4', u'\xe4'
This prints 'ä ä' on a Unicode terminal, but the second string causes an error when executed with LANG=C. I don't want Python to be smart but simply convert u'\xe4' to UTF-8 so it's just '\xc3\xa4' in memory.
I tried all combinations of decode(), encode() and unicode() that I can imagine but it seems I missed the right combination.
What I actually want is reading Unicode charaters through vi's system() function, like
:echo system('python foo.py')
To encode a unicode to utf-8, call .encode('utf-8') on it.:
>>> u'\xe4'.encode('utf-8')
'\xc3\xa4'
Related
I am not able to convert
'Schutzt\xc3\xbcren'.encode("utf-8")
the following to unicode, but cannot, getting the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)
I would like to get
'Schutztüren'
as a result.
Your string is already in utf-8. You need to decode it to Unicode in order to use it inside Python:
print 'Schutzt\xc3\xbcren'.decode("utf-8")
But you have a bigger problem: You are clearly using Python 2. Switch to Python 3 immediately, there is no reason to drive yourself crazy trying to understand the Python 2 approach to handling character encodings. Switch to Python 3 and you will not have to bang your head against your desk several times a day. (Note that although you were calling the encode() method, you got a
UnicodeDecodeError.
A simple explanation:
In Python, unicode and utf-8 are different things. A str in Python 2 might be in the "utf-8" encoding, unicode objects have no encoding.
If you try to use a str for something that requires unicode (e.g., to encode() it), or vice versa, Python 2 will try to implicitly convert it first. Except it doesn't know the encoding of your strings, so it guesses (ascii, in your case). Oops.
Python2 has a lot of implicit conversions.
But really the reason is simple: You are not using Python 3.
Edit: Since Python 3 is not an option, here is some practical advice:
Unicode sandwich: Convert all text to Unicode as soon as it's read in, work with unicode strings and encode back to a utf8 str only to write it out again.
Pandas should still support the encoding argument to to_csv(), even on Python 2. Use it to write your files in utf8.
For reading a file directly, use codecs.open() instead of plain open() to read files. It accepts the encoding= argument and will give you unicode strings.
You need to use decode utf-8 encoded string to unicode instead.
'Schutzt\xc3\xbcren'.decode("utf-8")
in python 3 you'd need to decode the bytes that are your encoded string:
b'Schutzt\xc3\xbcren'.decode("utf-8")
in python 2 the b is not necessary (here the distinction between bytes and strings is less strict...).
When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.
I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:
mystring="this is unicode string:"+japanesevalues[1]
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring
I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.
If I execute this:
#python myscript.py
this is unicode string: カラダーズ ソフィー
But if I do that:
#python myscript.py > output
I got the typical error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)
Why is that?
The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).
If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.
I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".
I have a fairly large python 2.6 application with lots of print statements sprinkled about. I'm using unicode strings throughout, and it usually works great. However, if I redirect the output of the application (like "myapp.py >output.txt"), then I occasionally get errors such as this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)
I guess the same issue comes up if someone has set their LOCALE to ASCII. Now, I understand perfectly well the reason for this error. There are characters in my Unicode strings that are not possible to encode in ASCII. Fair enough. But I'd like my python program to make a best effort to try to print something understandable, maybe skipping the suspicious characters or replacing them with their Unicode ids.
This problem must be common... What is the best practice for handling this problem? I'd prefer a solution that allows me to keep using plain old "print", but I can modify all occurrences if necessary.
PS: I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.
If you're dumping to an ASCII terminal, encode manually using unicode.encode, and specify that errors should be ignored.
u = u'\xa0'
u.encode('ascii') # This fails
u.encode('ascii', 'ignore') # This replaces failed encoding attempts with empty string
If you want to store unicode files, try this:
u = u'\xa0'
print >>open('out', 'w'), u # This fails
print >>open('out', 'w'), u.encode('utf-8') # This is ok
I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.
Either wrap all your print statement through a method perform arbitrary unicode -> utf8 conversion or as last resort change the Python default encoding from ascii to utf-8 inside your site.py. In general it is a bad idea printing unicode strings unfiltered to sys.stdout since Python will trigger an implict conversion of unicode strings to the configured default encoding which is ascii.