Unicode conversion issue using Python in Emacs - python

I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.
The script looks like this (I'm using Python 2.7.1 BTW):
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")
that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:
'{"Foo": "\\u30b6"}'
which when printed looks like:
'{"Foo": "\u30b6"}'
the decode part of this looks like:
u'{"Foo": "\u30b6"}'
which when printed looks like:
{"Foo": "ザ"}
i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.
In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`
I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s'"
produces the same exception, while
"LC_ALL=\"en_US.UTF-8\" python -c 'import sys; enc=sys.stdout.encoding; print enc' "
indicates that the encoding is 'None'.
If I attempt to coerce the conversion using:
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s.encode(\"utf8\",\"replace\")'"
the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:
Fooa?¶
Any ideas?
UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")
Note that this is the "raw" Python without the necessary escaping required by Emacs.
As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.

The Python wiki page, "PrintFails" says
When Python does not detect the desired character set of the output,
it sets sys.stdout.encoding to None, and print will invoke the "ascii"
codec.
It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.
Replacing u\"Fooザ\" with u\"Foo\\u30b6\" seems to work:
(defun mytest ()
(interactive)
(shell-command-on-region (point)
(point) "LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Foo\\u30b6\"; print s.encode(\"utf8\",\"replace\")'" nil t))
C-x C-e M-x mytest
yields
Fooザ

Related

Python program is running in IDLE but not in command line

Before someone says this is a duplicate question, I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.
I am trying to run a very short script in Python
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
print(soup.prettify())
But I keep getting an error when I run this program with python.exe. UnicodeEncodeError: 'charmap' codec can't encode character '\u025c. I have tried a lot of methods to get around this, but I managed to isolate it to the problem of converting bytes to strings. When I run this program in IDLE, I get the HTML as expected. What is it that IDLE is automatically doing? Can I use IDLE's interpretation program instead of python.exe? Thanks!
EDIT:
My problem is caused by print(soup.prettify()) but type(soup.prettify()) returns str?
RESOLVED:
I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers
UnicodeEncodeError: 'charmap' codec can't encode character '\u025c'
The console character encoding can't represent '\u025c' i.e., "ɜ" Unicode character (U+025C LATIN SMALL LETTER REVERSED OPEN E).
What is it that IDLE is automatically doing?
IDLE displays Unicode directly (only BMP characters) if the corresponding font supports given Unicode characters.
Can I use IDLE's interpretation program instead of python.exe
Yes, run:
T:\> py -midlelib -r your_script.py
Note: you could write arbitrary Unicode characters to the Windows console if Unicode API is used:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.
Not really. You have PrintFails like everyone else.
The Windows console can't print Unicode. (This isn't strictly true, but going into exactly why, when and how you can get Unicode out of the console is a painful exercise and not usually worth it.) Trying to print a character that isn't in the console's limited encoding can't work, so Python gives you an error.
print them out (which I need an easier solution to because I cannot do .encode("utf-8") for a lot of elements
You could run the command set PYTHONIOENCODING=utf-8 before running the script to tell Python to use and encoding which can include any character (so no errors), but any non-ASCII output will still come out garbled as its encoding won't match the console's actual code page.
(Or indeed just use IDLE.)
I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers

Understanding Python Unicode and Linux terminal

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:
mystring="this is unicode string:"+japanesevalues[1]
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring
I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.
If I execute this:
#python myscript.py
this is unicode string: カラダーズ ソフィー
But if I do that:
#python myscript.py > output
I got the typical error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)
Why is that?
The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).
If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

Python 2.7: How can I pass in arguments such as 'café' from the shell and not get 'cafÚ'?

I've a programm that gets an argument from the shell. This argument will be the query used in a search operation.
If I pass in English words (i.e. no accents, etc.), it works fine. Nevertheless, if I pass in, namely, 'café', I get 'cafú' (print sys.argv[1] results in cafÚ instead of café).
I thought I could solve the problem by converting it into a Unicode object, but I was wrong.
Q = unicode(sys.argv[1], encoding=sys.stdin.encoding)
I still get 'cafÚ'!! I'm going crazy...
I bet you're on Windows, right?
>>> a = "café"
>>> a
'caf\x82'
>>> print a
café
>>> a.decode("cp850") # DOS codepage 850 --> Unicode
u'caf\xe9'
>>> a.decode("cp850").encode("cp1252") # DOS 850 --> Unicode --> Windows 1252
'caf\xe9' # identical to Unicode codepoint
>>> print a.decode("cp850").encode("cp1252") # Display a cp1252 string in cp850
cafÚ
Use encoding="cp1252" instead, then it should work.
Explanation: (with some guesswork)
cmd windows use cp850 as their default codepage. This is evident from the second line in my session above, 0x82 is é in cp850.
It appears that Python programs started under Windows use cp1252 as their standard encoding, shown by the last line of the session above: é is 0xe9 in cp1252 (like in Unicode).
This is also evident when you write this string to a file (which by default uses cp1252):
If I do f.write(a), I get caf, as the contents of my file because , is 0x82 in cp1252).
If I do f.write(a.decode("cp850").encode("cp1252")), I get café.
Moral: Find out the correct encodings in your environment, convert everything to Unicode as soon as possible, work with it, then convert back to the encoding you need. If you're outputting into an interactive window, use cp850, if you're outputting into a file, use cp1252.
Or switch to Python 3 which makes all of this much easier.

Python program works in Eclipse but not when I run it directly (Unicode stuff)

I have searched and found some related problems but the way they deal with Unicode is different, so I can't apply the solutions to my problem.
I won't paste my whole code but I'm sure this isolated example code replicates the error:
(I'm also using wx for GUI so this is like inside a class)
#coding: utf-8
...
something = u'ЧЕТЫРЕ'
//show the Russian text in a Label on the GUI
self.ExampleLabel.SetValue(str(self.something))
On Eclipse everything works perfectly and it displays the Russian characters. However when I try to open up Python straight through the file I get this error on the CL:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11:
ordinal not in range(128)
I figured this has something to do with the CL not being able to ouput the Unicode chars and Eclipse doing behind-the-scene magic. Any help on how to make it so that it works on its own?
When you call str() on something without specifying an encoding, the default encoding is used, which depends on the environment your program is running in. In Eclipse, that's different from the command line.
Don't rely on the default encoding, instead specify it explicitly:
self.ExampleLabel.SetValue(self.something.encode('utf-8'))
You may want to study the Python Unicode HOWTO to understand what encoding and str() do with unicode objects. The wxPython project has a page on Unicode usage as well.
Try self.something.encode('utf-8') instead.
If you use repr instead of str it should handle the conversion for you and also cover the case that the object is not always of type string, but you may find that it gives you an extra set of quotes or even the unicode u in your context. repr is safer than str - str assumes ascii encoding, but repr is going to show your codepoints in the same way that you would see them in code, since wrapping with eval is supposed to convert it back to what it was - the repr has to be in a form that the python code would be in, namely ascii safe since most python code is written in ascii.

Trying to print unicode characters to the console using string.format()

below is the snippet in question:
print '{:─^10}'.format('') # Print '─' character 10 times
I'm using this to create nice console borders and such. The problem is, running this in my py file with # coding UTF-8 gives me: ValueError: Invalid conversion specification
If I run this same script in the python shell, it spits out the escaped characters: '\xc4\xc4\xc4\x...'
I don't know how (in my script) to get this to print out the '─' character. It can print the '─' character just fine if I use print '─' because of the UTF-8 encoding, but for some reason it won't allow it in the string.format() function.
Any suggestions? I'm sure this is probably easy to fix, but i'm VERY new to python programming.
Thanks in advance.
Assuming you're using Python2, you need to use unicode (u'') strings:
print u'{:─^10}'.format(u'')

Categories

Resources