Good day
Can somebody explain what is going on behind the Django manage.py shell console?
The problem is following. I'm developing a Django app, which is using an urllib to parse some html pages to get some info from them. And that info is in russian language, so it should be unicode (this is address string in this case). Next, my script feeds this to some other third-party module which falls, because it is not valid unicode string (I'm trying to geodecode point from address).
I tried to print the string (parsed address in this case) to console with print address command but it fails:
File "<console>", line 1, in <module>
... some useless stacktrace ...
print address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Now comes the interesting part.
I have 2 computers: workstation with Ubuntu and Python 2.7.2 and Debian Lenny VPS with Python 2.7.2. I start parser the same way on both machines: by executing python manage.py shell and calling my function from it.
First I got the same error on both installations, but then I noticed that my python encoding is set to 'ascii' (import sys; sys.getdefaultencoding()). And when I put
import sys; reload(sys).setdefaultencoding('utf-8')
into settings.py the problem solves for Ubuntu. Now I get proper print on it, e.g. г. Челябинск, ул. Кирова, д. 27, КТК Набережный, but this is not working for Debian.
If i delete this print address string than, I get non-readable geolocation errors, but again - only on Debian. Ubuntu is working just fine:
Failed to geodecode address [г. ЧелÑбинÑк, Ñл. 1-ой ÐÑÑилеÑки, 17/1, ÑÑнок ÐÑÑак, 1-з]
No amount of unicode(address).encode('utf-8') magic can help this.
So I just can't get it. What's the differences between machines that cause me so much trouble?
If you run the following python script, you'll see what's happening:
# -*- coding: utf-8 -*-
a = r"Челябинск"
print "Encode from UTF-8 to UTF-8:",unicode(a,'utf-8').encode('utf-8')
print "Encode from ISO8859-1 to UTF-8:",unicode(a,'iso8859-1').encode('utf-8')
The output is:
Encode from ISO8859-1 to UTF-8: Челябинск
Encode from ISO8859-1 to UTF-8: ЧелÑбинÑк
In essence you're taking a string encoded (already) as UTF-8 and re-encoding it (a second time, as if it were ISO8859-1) into UTF-8.
It's worth checking what the default encoding of the machine is in each case.
If anyone can add to this answer then please do.
Related
I'm building a little django 1.1 app (though I believe this issue to be specific to Python) where I've come to use commands to control the flow of getting and categorizing data. I also wish to print a sort of summary using a third command. I am using macOS 10.12.3
My problem comes from getting text data in and printing it to the console or a document using
> or >>
in the console.
I'm running these scripts using an alias of Python 3.6.1
I'm using the Tweepy api, but that should hopefully not be relevant.
These snippets should illustrate the problem I'm hoping to solve:
print(type(data))
print(type(data.text))
try:
print(data.text)
except UnicodeEncodeError:
print("no printing today :(")
print(type(data.text.encode('UTF-8')))
print(data.text.encode('UTF-8'))
this outputs:
<class 'tweepy.models.Status'>
<class 'str'>
no printing today :(
<class 'bytes'>
b'kontroll p\xc3\xa5 ... v\xc3\xa5pen.'
The ugly things there should both be the character 'å'.
This is the error that would be thrown:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 223: ordinal not in range(128)
It says 'ascii' codec, but doing (in my Python 3.6.1 script):
print(sys.getdefaultencoding())
outputs:
utf-8
Running
print(sys.getdefaultencoding())
again in Python 2.7.10 outputs:
ascii
So the thrown error matches what 2.7.10 outputs. I am not discounting the possibility that I could be wrong about what a default encoder does
I have also tried
export LOCALE="no_NB.UTF-8"
in an attempt to see if this could be caused by my system (unless I'm misunderstanding what this does). I did not write this to any file, thinking it would persist through the current session.
Is the wrong encoder being used somehow? Could it be my terminal encoding? How can I write my special characters to the terminal and file? Are strings really this hard to get right?
Any help is greatly appreciated!!
Setting
export LC_ALL=no_NO.UTF-8
export LANG=no_NO.UTF-8
in my .bash_profile now allows me to see the characters I want in my terminal and it is also successfully echoed to a file.
I'm using Eclipse+PyDev to write code and often face unicode issues when moving this code to production. The reason is shown in this little example
a = u'фыва '\
'фыва'
If Eclipse see this it creates unicode string like nothing happened, but if type same command directly to Python shell(Python 2.7.3) you'll get this:
SyntaxError: (unicode error) 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)
because correct code is:
a = u'фыва '\
u'фыва'
But because of Eclipse+PyDev's "tolerance" I always get in trouble :( How can I force PyDev to "follow the rules"?
This happens because the encoding for the console is utf-8.
There's currently no way to set that globally in the UI, although you can change it by editing: \plugins\org.python.pydev_2.7.6\pysrc\pydev_sitecustomize\sitecustomize.py
And just remove the call to: (line 108) sys.setdefaultencoding(encoding)
This issue should be fixed in PyDev 3.4.0 (not released yet). Fabio (PyDev maintainer) says: "from now on PyDev will only set the PYTHONIOENCODING and will no longer change the default encoding". And PYTHONIOENCODING is supported since Python 2.6.
Here is the commit on GitHub.
Try adding # -*- coding: utf-8 -*- as the first line of your source files. It should make Python behave.
This solved the problem for me in my source code without having to modify the pydev sitecustomize.py file:
import sys
reload(sys).setdefaultencoding("utf-8")
You could use "ascii" or whatever other encoding you wanted to use.
In my case, the when I ran the program on the command-line, PyDev was using "utf-8", whereas the console was incorrectly setting "ascii".
It may not be what you are asking. But for my case I got these UTF-8 characters by accident copying my code from various sources. To find what character is making troubles I did in my Eclipse Mars:
Edit->set encoding
other->US ASCII
then I tried to save my file. And I got modal window telling me "Save problems". There was button "Select First Character"
It showed me troubling character and I just deleted that character and typed ASCII one.
I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
I type:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?
I continue, and type:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?
Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
Output to a file instead
Set the PYTHONIOENCODING environment variable and then redirect the output to a file.
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.
In case others get this page when searching
Easiest way is to set the codepage in the terminal first
CHCP 65001
then run your program.
working well for me.
For power shell start it with
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
Read through this python HOWTO about unicode after you read this section from the tutorial
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''are you creating a unicode string.
2nd question:
sys.getdefaultencoding()
returns the default encoding
But to quote from link:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
Actually, unicode object has no
'encoding'. You should read up on
Unicode in python to avoid constant
confusion. This presentation looks
adequate -
http://farmdev.com/talks/unicode/ .
You are on russian version of
windows, right? You terminal uses
cp1251.
As you've figured out:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors?
If so, try to open it with
import codecs
f = codecs.open('filename.txt','r','utf-8')
I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(
sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶
Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.
You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.
In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.
I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)
# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.
If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.
Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.
Convert it to a unicode string through:
print unicode(name)