Converting from ascii to utf-8 with Python - python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(

sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶

Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.

You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.

In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

Related

Python 3.5: Exporting Chinese Characters

I have been trying several times to export Chinese from list variables to csv or txt file and found problems with that.
Specifically, I have already set the encoding as utf-8 or utf-16 when reading the data and writing them to the file. However, I noticed that I cannot do that when my Window 7’s base language is English, even that I change the language setting to Chinese. When I run the Python programs under the Window 7 with Chinese as base language, I can successfully export and show Chinese perfectly.
I am wondering why that happens and any solution helping me show Chinese characters in the exported file when running the Python programs under English-based Window?
I just found that you need to do 2 things to achieve this:
Change the Window's display language to Chinese.
Use encoding UTF-16 in the writing process.
Here's US Windows 10, running a Python IDE called PythonWin. There are no problems with Chinese.
Here's the same program running in a Windows console. Note the US codepage default for the console is cp437. cp65001 is UTF-8. Switching to an encoding that supports Chinese text is the key. The text below was cut-and-pasted directly from the console. While the characters display correctly pasted to Stack Overflow, the console font didn't support Chinese and actually displayed .
C:\>chcp
Active code page: 437
C:\>x
Traceback (most recent call last):
File "C:\\x.py", line 5, in <module>
print(f.read())
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
C:\>chcp 65001
Active code page: 65001
C:\>type test.txt
我是美国人。
C:\>x
我是美国人。
Notepad displays the output file correctly as well:
Either use an IDE that supports UTF-8, or write your output to a file and read it with a tool like Notepad.
Ways to get the Windows console to actually output Chinese are the win-unicode-console package, and changing the Language and Region settings, Administrative tab, system locale to Chinese. For the latter, Windows will remain English, but the Windows console will use a Chinese code page instead of an English one.

UnicodeEncodeError - works in Spyder but not when executed from terminal

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.
That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

handling unicode strings in Windows

For the first time, I was trying out one of my Python scripts, which deals with unicode characters, on Windows (Vista) and found that it's not working. The script works perfectly okay on Linux and OS X but no joy on Windows. Here is the little script that I tried:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
print "\nDefault encoding\t: %s" % sys.getdefaultencoding()
print "sys.stdout.encoding\t: %s\n" % sys.stdout.encoding
## Unicode strings
ln1 = u"?0>9<8~7|65\"4:3}2{1+_)(*&^%$£#!/`\\][=-"
ln2 = u"mnbvc xzasdfghjkl;'poiuyàtrewq€é#¢."
refStr = u"%s%s" % (ln2,ln1)
print "refSTR: ", refStr
for x in refStr:
print "%s => %s" % (x, ord(u"%s" % x))
When I run the script from Windows CLI, I get this error:
C:\Users\san\Scripts>python uniCode.py
Default encoding : utf-8
sys.stdout.encoding : cp850
refSTR; Traceback (most recent call last):
File "uniCode.py", line 18, in <module>
print "refSTR; ", refStr
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in position
30: character maps to <undefined>
I came across this Python-wiki and tried a few things from there but that didn't work. Does anyone know what I'm still missing? Any help greatly appreciated. Cheers!!
The Windows console has a Unicode API, but not utf-8. Python is trying to encode Unicode characters to your console's 8-bit code page cp850, which obviously won't work. There's supposedly a code page (chcp 65001) in the Windows console that supports utf-8, but it's severely broken. Read issue 1602 and look at sys_write_stdout.patch and unicode2.py, which use Unicode wide character functions such as WriteConsoleOutputW and WriteConsoleW. Unfortunately it's a low priority issue.
FYI, you can also use IDLE, or another GUI console (based on pythonw.exe), to run a script that outputs Unicode characters. For example:
C:\pythonXX\Lib\idlelib\idle.pyw -r script.py
But it's not a general solution if you need to write CLI console tools.
The setdefaultencoding and getdefaultencoding denote the encoding followed by the python interpreter and while you use sys.stdout.encoding, it denotes the encoding used by your terminal. You can verify this, if you were to write it in the file vs print in the terminal.
The way to 'fix' this program would be to set the terminal encoding to something that you want (utf-8) or write to the file and open the output in the editor which supports those particular characters.

python: unicode in Windows terminal, encoding used?

I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
I type:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?
I continue, and type:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?
Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
Output to a file instead
Set the PYTHONIOENCODING environment variable and then redirect the output to a file.
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.
In case others get this page when searching
Easiest way is to set the codepage in the terminal first
CHCP 65001
then run your program.
working well for me.
For power shell start it with
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
Read through this python HOWTO about unicode after you read this section from the tutorial
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''are you creating a unicode string.
2nd question:
sys.getdefaultencoding()
returns the default encoding
But to quote from link:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
Actually, unicode object has no
'encoding'. You should read up on
Unicode in python to avoid constant
confusion. This presentation looks
adequate -
http://farmdev.com/talks/unicode/ .
You are on russian version of
windows, right? You terminal uses
cp1251.
As you've figured out:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors?
If so, try to open it with
import codecs
f = codecs.open('filename.txt','r','utf-8')

python unichr problem

I've got some problem with unichr() on my server. Please see below:
On my server (Ubuntu 9.04):
>>> print unichr(255)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)
On my desktop (Ubuntu 9.10):
>>> print unichr(255)
ÿ
I'm fairly new to python so I don't know how to solve this. Anyone care to help? Thanks.
When using the "print" keyword, you'll be writing to the sys.stdout output stream. sys.stdout can usually only display Unicode strings if the characters can be converted to ascii using str(message).
You'll need to encode to your OS's terminal encoding when printing to be able to do this.
The locale module can sometimes detect the encoding of the output console:
import locale
print unichr(0xff).encode(locale.getdefaultlocale()[1], 'replace')
but it's usually better to just specify the encoding yourself, as python often gets it wrong:
print unichr(0xff).encode('latin-1', 'replace')
UTF-8 or latin-1 I think is often used in many modern linux distros.
If you know the encoding of your console, the lines below will encode Unicode strings automatically when you use "print":
import sys
import codecs
sys.stdout = codecs.getwriter(ENCODING)(sys.stdout)
If the encoding is ascii or something similar, you may need to change the console encoding of your OS to be able to display that character.
See also: http://wiki.python.org/moin/PrintFails
The terminal settings on your server are different, probably set to 7-bit US ASCII.
It's not really unichr() related. Problem is with locale setting in your server environment, as it's probably set to something like en_US and it's not unicode aware.
Consider using an explicit encoding when printing unicode strings where OS settings are not uniform.
unicode.encode([encoding[, errors]])
Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.
For example,
>>> print unichr(0xff).encode('iso8859-1')
����??
>>>

Categories

Resources