UnicodeDecodeError with an unicode variable

UnicodeDecodeError with an unicode variable - python

I'm using an unicode variable and replacing some characters, but when I try to process a certain value it raises an error of UnicodeDecodeError, when I have set at the beginning of python's file the coding.
I tried this coding: iso-8859-15, cp1251,and I took a look to this but doesn't when the variable's value contains this character: `
At the terminal this works:
a='Don\xb4t dream it\xb4s over'
a = a.replace("\xb4","'")
print a
output: Don't dream it's over
Why does it work in the terminal but not in my python's file?.

The code is working for me. Here's what I have done:
Copy the following code to a file, and name it as test.py
a='Don\xb4t dream it\xb4s over'
a = a.replace("\xb4","'")
print a
Run test.py python ./test.py, and here's the output
Don't dream it's over
My python version is Python 2.7.3

You need to decode from the proper code page into Unicode. Then if you need it in another code page (such as UTF-8) you can re-encode it. WHen you use print Python will try to encode it to the code page of your terminal automatically.
>>> a = a.decode('iso-8859-1')
>>> print a
Don´t dream it´s over
Edit: Trying to decipher the actual question is difficult. Perhaps you're trying to read the text from a file and that's what's not working? Again it's important to know the encoding of the file. Many modern files use UTF-8 encoding.
a = f.readline()
a = a.decode('utf-8')
print a

Related

How to read excel Unicode characters using Python

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".
My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.
When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)
Being more concrete I perform the following:
Read the Excel:
xlApp = Dispatch("Excel.Application")
excel = xlApp.Workbooks.Open(excel_location)
in an internal loop I extract the value of the cell:
cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)
and finally, if I try to convert cell_value to str, crashes:
print str(cell_value)
If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.
What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.
Any idea how can I progress?

This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.
The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).
Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.
Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).
That's when the ascii encoder says: Sorry, can't help you there.
That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:
s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii
If your terminal supports UTF-8, then you need to tell Python about it:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.
Related:
Python 2 Unicode howto
Pragmatic Unicode, or, How do I stop the pain?
Setting the correct encoding when piping stdout in Python

.Cells(row,col) returns a Range object. You probably want the text from the cell:
cell = xl.ActiveSheet.Cells(1,2).Text
or
cell = xl.ActiveSheet.Range('B1').Text
The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding), for example:
bytes = cell.encode('utf8')
The below example uses the following spreadsheet:
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value
Output (Note, Chinese will only print if console/IDE supports the characters):
<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人

What is described here is a hack, you should not use as a long term
solution. Looking at the comments it could crush the terminal.
Finally I found a solution helped by the suggestion that #Huan-YuTseng provided, probably the solutions offered by other might work in other context but not in this one.
So, what happened is that I migrated from Eclipse Juno version (as Pydev stopped working due to Java upgrade needed that I can't accomplish in this computer) to LiClipse direct package (I did not upgraded a downloaded Eclipse version).
By default, in my LiClipse version (1.4.0.201502042042) the Console output is not by default utf-8. So I needed to change the output from either LiClipse or using my code. Fourtunately, there was another question related to a similar problem that helped me. You can see more details here, but essentially what you need to do is to include at the begginning of your code the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
And everything works. In the answers from #AarongDigulla the solution is there, but is actually the very last solution.
However, I need to say that LiClipse is giving me an error on sys.setdefaultencoding statement, that during execution is not creating any issue... no idea what's happening. That stopped me testing this solution before. Maybe there is something wrong in LiClipse (is alowing me to execute code with errors!)

Use 'utf-8 BOM' which in python used as utf_8_sig for Unicode character & also to avoid irrelevant results in Excel sheet.

UnicodeEncodeError when parsing XML using cElementTree within Applescript

Apologies if this is a duplicate or something really obvious, but please bear with me as I'm new to Python. I'm trying to use cElementTree (Python 2.7.5) to parse an XML file within Applescript. The XML file contains some fields with non-ASCII text encoded as entities, such as <foo>café</foo>.
Running the following basic code in Terminal outputs pairs of tags and tag contents as expected:
import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
tree = etree.parse("myfile.xml", parser=parser)
root = tree.getroot()
for child in root:
print child.tag, child.text
But when I run that same code from within Applescript using do shell script, I get the dreaded UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128).
I found that if I change my print line to
print [child.tag, child.text]
then I do get a string containing XML tag/value pairs wrapped in [''], but any non-ASCII characters then get passed onto Applescript as the literal Unicode character string (so I end up with u'caf\\xe9').
I tried a couple of things, including a.) reading the .xml file into a string and using .fromstring instead of .parse, b.) trying to convert the .xml file to str before importing it into cElementTree, c.) just sticking .encode wherever I could to see if I could avoid the ASCII codec, but no solution yet. I'm stuck using Applescript as a container, unfortunately. Thanks in advance for advice!

You need to encode at least child.text into something that Applescript can handle. If you want the character entity references back, this will do it:
print child.tag.encode('ascii', 'xmlcharrefreplace'), child.text.encode('ascii', 'xmlcharrefreplace')
Or if it can handle something like utf-8:
print child.tag.encode('utf-8'), child.text.encode('utf-8')

Not AppleScript's fault - it's Python being "helpful" by guessing for you what output encoding to use. (Unfortunately, it guesses differently depending whether or not a terminal is attached.)
Simplest solution (Python 2.6+) is to set the PYTHONIOENCODING environment variable before invoking python:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python '/path/to/script.py'"
or:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python << EOF
# -*- coding: utf-8 -*-
# your Python code goes here...
print u'A Møøse once bit my sister ...'
EOF"

UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function [duplicate]

This question already has answers here:
Python, Unicode, and the Windows console
(15 answers)
Closed 5 years ago.
I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.
The code is like this:
conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));
the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>
I could fix it using this quite ugly code:
print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))
Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.
There are several things I do not like from my solution.
The code is ugly with all that decoding, encoding, and decoding.
It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.
The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).
So, the question:
Is there a nicer solution that makes my code agnostic from the output interface encoding?

I see three solutions to this:
Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.
Following example code makes the output aware of your target charset.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
This example properly replaces any non-printable character in my name with a question mark.
If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.
Reset the output encoding globally at the begin of the software:
The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:
In Python 2:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
In Python 3:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.
Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"Stöcker" # works
print "Stöcker".decode("utf-8") # works
print "Stöcker" # fails

Based on Dirk Stöcker's answer, here's a neat wrapper function for Python 3's print function. Use it just like you would use print.
As an added bonus, compared to the other answers, this won't print your text as a bytearray ('b"content"'), but as normal strings ('content'), because of the last decode step.
def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
enc = file.encoding
if enc == 'UTF-8':
print(*objects, sep=sep, end=end, file=file)
else:
f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
print(*map(f, objects), sep=sep, end=end, file=file)
uprint('foo')
uprint(u'Antonín Dvořák')
uprint('foo', 'bar', u'Antonín Dvořák')

For debugging purposes, you could use print(repr(data)).
To display text, always print Unicode. Don't hardcode the character encoding of your environment such as Cp850 inside your script. To decode the HTTP response, see A good way to get the charset/encoding of an HTTP response in Python.
To print Unicode to Windows console, you could use win-unicode-console package.

I dug deeper into this and found the best solutions are here.
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
In my case I solved "UnicodeEncodeError: 'charmap' codec can't encode character "
original code:
print("Process lines, file_name command_line %s\n"% command_line))
New code:
print("Process lines, file_name command_line %s\n"% command_line.encode('utf-8'))

If you are using Windows command line to print the data, you should use
chcp 65001
This worked for me!

If you use Python 3.6 (possibly 3.5 or later), it doesn't give that error to me anymore. I had a similar issue, because I was using v3.4, but it went away after I uninstalled and reinstalled.

Why all those unicode commands works CORRECT in Python? They all print my character correctly, no matter what i do

Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?

You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?

All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.

What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).

Because you are correctly using the magic "coding comment," everything works as supposed.

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.