python... encoding issue when using linux > [duplicate]

python... encoding issue when using linux > [duplicate] - python

This question already has answers here:
Understanding Python Unicode and Linux terminal
(2 answers)
Closed 9 years ago.
simple test program of an encoding issue:
#!/bin/env python
# -*- coding: utf-8 -*-
print u"Råbjerg" # >>> unicodedata.name(u"å") = 'LATIN SMALL LETTER A WITH RING ABOVE'
here is what i get when i use it from a debian command box, i do not understand why using redirect here broke the thing, as i can see it correctly when using without.
can someone help to understand what i have missed? and what should the right way to print this characters so that they are ok everywhere?
$ python testu.py
Råbjerg
$ python testu.py > A
Traceback (most recent call last):
File "testu.py", line 3, in <module>
print u"Råbjerg"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 1: ordinal not in range(128)
using debian Debian GNU/Linux 6.0.7 (squeeze) configured with:
$ locale
LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=
EDIT: from other similar questions seen later from the pointing done below
#!/bin/env python1
# -*- coding: utf-8 -*-
import sys, locale
s = u"Råbjerg" # >>> unicodedata.name(u"å") = 'LATIN SMALL LETTER A WITH RING ABOVE'
if sys.stdout.encoding is None: # if it is a pipe, seems python2 return None
s = s.encode(locale.getpreferredencoding())
print s

When redirecting the output, sys.stdout is not connected to a terminal and Python cannot determine the output encoding. When not directing the output, Python can detect that sys.stdout is a TTY and will use the codec configured for that TTY when printing unicode.
Set the PYTHONIOENCODING environment variable to tell Python what encoding to use in such cases, or encode explicitly.

Use: print u"Råbjerg".encode('utf-8')
Similar question was asked today : Understanding Python Unicode and Linux terminal

I'll suggest you to output it already encoded:
print u"Råbjerg".encode('utf-8')
This will write the correct bytes of the string in utf-8 and you'll be able to see in almost every editor/console which support utf-8

Related

Proper use of unicode characters in python3 - Force utf-8 encoding

I'm going crazy here. The internet and this SO question tell me that in python 3.x, the default encoding is UTF-8. In addition to that, my system's default encoding is UTF-8. In addition to that, I have # -*- coding: utf-8 -*- at the top of my python 3.5 file.
Still, python is using ascii:
# -*- coding: utf-8 -*-
mystring = "Ⓐ"
print(mystring)
Greets me with:
SyntaxError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
I've also tried this: print(mystring.encode("utf-8")) and .decode("utf-8") - Same thing.
What am I missing here? How do I force python to stop using ascii encoding?
Edit: I know that it seems weird to complain about position 7 with a one character string, but this is my actual MCVE and the exact output I'm getting. The above is using python shell, the below is in a script. Both use python 3.5.2.
Edit: Since I figured it might be relevant: The string I'm getting comes from an external application and is not hardcoded, so I need a way to get that utf-8 string and save it into a file. The above is just a minimalized and generalized example. Here is my real-life code:
# the variables being a string that might contain unicode characters
mystring = "username: " + fromuser + " | printname: " + fromname
with open("myfile.txt", "a") as myfile:
myfile.write(mystring + "\n")

In Python3 all strings are unicode, so the problem you're having is likely due to your locale settings not being correct. The Python3 interpreter looks to use the locale environment variables and if it cannot find them it emulates basic ASCII
From locale.py:
except ImportError:
# Locale emulation
CHAR_MAX = 127
LC_ALL = 6
LC_COLLATE = 3
LC_CTYPE = 0
LC_MESSAGES = 5
LC_MONETARY = 4
LC_NUMERIC = 1
LC_TIME = 2
Error = ValueError
Double check the locale on your shell from which you are executing. Here are a few work arounds you can try to see if they get you working before you go through the task of getting your env setup correctly.
1) Validate UTF-8 locale or language files are installed (see link above)
2) Try adding this to the top of your script
#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
or
#!/usr/bin/env LANG=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
Or export shell variables before executing the Python interpreter
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
python3
>>> print('カタカナ')
Sorry I cannot be more specific, as these settings are platform and OS specific. You can forcefully attempt to set the locale in Python directly using the locale module, but I don't recommend that, and it won't help if they are not installed.
Hope that helps.

What's new in Python 3.0 says:
All text is Unicode; however encoded Unicode is represented as binary
data
If you want to try outputting utf-8, here's an example:
b'\x41'.decode("utf-8", "strict")
If you'd like to use unicode in a string literal, use the unicode escape and its coded representation. For your example:
print("\u24B6")

Django shell encoding error (Debian only, Ubuntu fine)

Good day
Can somebody explain what is going on behind the Django manage.py shell console?
The problem is following. I'm developing a Django app, which is using an urllib to parse some html pages to get some info from them. And that info is in russian language, so it should be unicode (this is address string in this case). Next, my script feeds this to some other third-party module which falls, because it is not valid unicode string (I'm trying to geodecode point from address).
I tried to print the string (parsed address in this case) to console with print address command but it fails:
File "<console>", line 1, in <module>
... some useless stacktrace ...
print address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Now comes the interesting part.
I have 2 computers: workstation with Ubuntu and Python 2.7.2 and Debian Lenny VPS with Python 2.7.2. I start parser the same way on both machines: by executing python manage.py shell and calling my function from it.
First I got the same error on both installations, but then I noticed that my python encoding is set to 'ascii' (import sys; sys.getdefaultencoding()). And when I put
import sys; reload(sys).setdefaultencoding('utf-8')
into settings.py the problem solves for Ubuntu. Now I get proper print on it, e.g. г. Челябинск, ул. Кирова, д. 27, КТК Набережный, but this is not working for Debian.
If i delete this print address string than, I get non-readable geolocation errors, but again - only on Debian. Ubuntu is working just fine:
Failed to geodecode address [Ð³. Ð§ÐµÐ»ÑÐ±Ð¸Ð½ÑÐº, ÑÐ». 1-Ð¾Ð¹ ÐÑÑÐ¸Ð»ÐµÑÐºÐ¸, 17/1, ÑÑÐ½Ð¾Ðº ÐÑÑÐ°Ðº, 1-Ð·]
No amount of unicode(address).encode('utf-8') magic can help this.
So I just can't get it. What's the differences between machines that cause me so much trouble?

If you run the following python script, you'll see what's happening:
# -*- coding: utf-8 -*-
a = r"Челябинск"
print "Encode from UTF-8 to UTF-8:",unicode(a,'utf-8').encode('utf-8')
print "Encode from ISO8859-1 to UTF-8:",unicode(a,'iso8859-1').encode('utf-8')
The output is:
Encode from ISO8859-1 to UTF-8: Челябинск
Encode from ISO8859-1 to UTF-8: Ð§ÐµÐ»ÑÐ±Ð¸Ð½ÑÐº
In essence you're taking a string encoded (already) as UTF-8 and re-encoding it (a second time, as if it were ISO8859-1) into UTF-8.
It's worth checking what the default encoding of the machine is in each case.
If anyone can add to this answer then please do.

handling unicode strings in Windows

For the first time, I was trying out one of my Python scripts, which deals with unicode characters, on Windows (Vista) and found that it's not working. The script works perfectly okay on Linux and OS X but no joy on Windows. Here is the little script that I tried:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
print "\nDefault encoding\t: %s" % sys.getdefaultencoding()
print "sys.stdout.encoding\t: %s\n" % sys.stdout.encoding
## Unicode strings
ln1 = u"?0>9<8~7|65\"4:3}2{1+_)(*&^%$£#!/`\\][=-"
ln2 = u"mnbvc xzasdfghjkl;'poiuyàtrewq€é#¢."
refStr = u"%s%s" % (ln2,ln1)
print "refSTR: ", refStr
for x in refStr:
print "%s => %s" % (x, ord(u"%s" % x))
When I run the script from Windows CLI, I get this error:
C:\Users\san\Scripts>python uniCode.py
Default encoding : utf-8
sys.stdout.encoding : cp850
refSTR; Traceback (most recent call last):
File "uniCode.py", line 18, in <module>
print "refSTR; ", refStr
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in position
30: character maps to <undefined>
I came across this Python-wiki and tried a few things from there but that didn't work. Does anyone know what I'm still missing? Any help greatly appreciated. Cheers!!

The Windows console has a Unicode API, but not utf-8. Python is trying to encode Unicode characters to your console's 8-bit code page cp850, which obviously won't work. There's supposedly a code page (chcp 65001) in the Windows console that supports utf-8, but it's severely broken. Read issue 1602 and look at sys_write_stdout.patch and unicode2.py, which use Unicode wide character functions such as WriteConsoleOutputW and WriteConsoleW. Unfortunately it's a low priority issue.
FYI, you can also use IDLE, or another GUI console (based on pythonw.exe), to run a script that outputs Unicode characters. For example:
C:\pythonXX\Lib\idlelib\idle.pyw -r script.py
But it's not a general solution if you need to write CLI console tools.

The setdefaultencoding and getdefaultencoding denote the encoding followed by the python interpreter and while you use sys.stdout.encoding, it denotes the encoding used by your terminal. You can verify this, if you were to write it in the file vs print in the terminal.
The way to 'fix' this program would be to set the terminal encoding to something that you want (utf-8) or write to the file and open the output in the editor which supports those particular characters.

python: unicode in Windows terminal, encoding used?

I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
I type:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?
I continue, and type:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?
Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows

Unicode is not an encoding. You encode into byte strings and decode into Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.

Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
Output to a file instead
Set the PYTHONIOENCODING environment variable and then redirect the output to a file.
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.

In case others get this page when searching
Easiest way is to set the codepage in the terminal first
CHCP 65001
then run your program.
working well for me.
For power shell start it with
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?

Read through this python HOWTO about unicode after you read this section from the tutorial
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''are you creating a unicode string.
2nd question:
sys.getdefaultencoding()
returns the default encoding
But to quote from link:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.

You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.

It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'

Actually, unicode object has no
'encoding'. You should read up on
Unicode in python to avoid constant
confusion. This presentation looks
adequate -
http://farmdev.com/talks/unicode/ .
You are on russian version of
windows, right? You terminal uses
cp1251.

As you've figured out:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors?
If so, try to open it with
import codecs
f = codecs.open('filename.txt','r','utf-8')

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python... encoding issue when using linux > [duplicate] - python

Use: print u"Råbjerg".encode('utf-8') Similar question was asked today : Understanding Python Unicode and Linux terminal

I'll suggest you to output it already encoded: print u"Råbjerg".encode('utf-8') This will write the correct bytes of the string in utf-8 and you'll be able to see in almost every editor/console which support utf-8

Related

Proper use of unicode characters in python3 - Force utf-8 encoding

Django shell encoding error (Debian only, Ubuntu fine)

handling unicode strings in Windows

python: unicode in Windows terminal, encoding used?

Unicode problems in PyObjC

Categories

Resources