UnicodeEncodeError when parsing XML using cElementTree within Applescript

UnicodeEncodeError when parsing XML using cElementTree within Applescript - python

Apologies if this is a duplicate or something really obvious, but please bear with me as I'm new to Python. I'm trying to use cElementTree (Python 2.7.5) to parse an XML file within Applescript. The XML file contains some fields with non-ASCII text encoded as entities, such as <foo>café</foo>.
Running the following basic code in Terminal outputs pairs of tags and tag contents as expected:
import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
tree = etree.parse("myfile.xml", parser=parser)
root = tree.getroot()
for child in root:
print child.tag, child.text
But when I run that same code from within Applescript using do shell script, I get the dreaded UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128).
I found that if I change my print line to
print [child.tag, child.text]
then I do get a string containing XML tag/value pairs wrapped in [''], but any non-ASCII characters then get passed onto Applescript as the literal Unicode character string (so I end up with u'caf\\xe9').
I tried a couple of things, including a.) reading the .xml file into a string and using .fromstring instead of .parse, b.) trying to convert the .xml file to str before importing it into cElementTree, c.) just sticking .encode wherever I could to see if I could avoid the ASCII codec, but no solution yet. I'm stuck using Applescript as a container, unfortunately. Thanks in advance for advice!

You need to encode at least child.text into something that Applescript can handle. If you want the character entity references back, this will do it:
print child.tag.encode('ascii', 'xmlcharrefreplace'), child.text.encode('ascii', 'xmlcharrefreplace')
Or if it can handle something like utf-8:
print child.tag.encode('utf-8'), child.text.encode('utf-8')

Not AppleScript's fault - it's Python being "helpful" by guessing for you what output encoding to use. (Unfortunately, it guesses differently depending whether or not a terminal is attached.)
Simplest solution (Python 2.6+) is to set the PYTHONIOENCODING environment variable before invoking python:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python '/path/to/script.py'"
or:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python << EOF
# -*- coding: utf-8 -*-
# your Python code goes here...
print u'A Møøse once bit my sister ...'
EOF"

Related

UnicodeEncodeError: 'ascii' codec can't encode character in print function

My company is using a database and I am writing a script that interacts with that database. There is already an script for putting the query on database and based on the query that script will return results from database.
I am working on unix environment and I am using that script in my script for getting some data from database and I am redirecting the result from the query to a file. Now when I try to read this file then I am getting an error saying-
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9741: ordinal not in range(128)
I know that python is not able to read file because of the encoding of the file. The encoding of the file is not ascii that's why the error is coming. I tried checking the encoding of the file and tried reading the file with its own encoding.
The code that I am using is-
os.system("Query.pl \"select title from bug where (ste='KGF-A' AND ( status = 'Not_Approved')) \">patchlet.txt")
encoding_dict3={}
encoding_dict3=chardet.detect(open("patchlet.txt", "rb").read())
print(encoding_dict3)
# Open the patchlet.txt file for storing the last part of titles for latest ACF in a list
with codecs.open("patchlet.txt",encoding='{}'.format(encoding_dict3['encoding'])) as csvFile
readCSV = csv.reader(csvFile,delimiter=":")
for row in readCSV:
if len(row)!=0:
if len(row) > 1:
j=len(row)-1
patchlets_in_latest.append(row[j])
elif len(row) ==1:
patchlets_in_latest.append(row[0])
patchlets_in_latest_list=[]
# calling the strip_list_noempty function for removing newline and whitespace characters
patchlets_in_latest_list=strip_list_noempty(patchlets_in_latest)
# coverting list of titles in set to remove any duplicate entry if present
patchlets_in_latest_set= set(patchlets_in_latest_list)
# Finding duplicate entries in list
duplicates_in_latest=[k for k,v in Counter(patchlets_in_latest_list).items() if v>1]
# Printing imp info for logs
print("list of titles of patchlets in latest list are : ")
for i in patchlets_in_latest_list:
**print(str(i))**
print("No of patchlets in latest list are : {}".format(str(len(patchlets_in_latest_list))))
Where Query.pl is the perl script that is written to bring in the result of query from database.The encoding that I am getting for "patchlet.txt" (the file used for storing result from HSD) is:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
Even when I have provided the same encoding for reading the file, then also I am getting the error.
Please help me in resolving this error.
EDIT:
I am using python3.6
EDIT2:
While outputting the result I am getting the error and there is one line in the file which is having some unknown character. The line looks like:
Some failure because of which vtrace cannot be used along with some trace.
I am using gvim and in gvim the "vtrace" looks like "~Vvtrace" . Then I checked on database manually for this character and the character is "–" which is according to my keyboard is neither hyphen nor underscore.These kinds of characters are creating the problem.
Also I am working on linux environment.
EDIT 3:
I have added more code that can help in tracing the error. Also I have highlighted a "print" statement (print(str(i))) where I am getting the error.

Problem
Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.
Specifically, this code:
for i in patchlets_in_latest_list:
print(str(i))
Results in this exception:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013'
This behaviour was common in Python2, where calling str on a unicode object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError if the object contained non-ASCII characters.
In Python3, calling str on a str instance doesn't trigger any encoding. However calling the print function on a str will encode the str to sys.stdout.encoding. sys.stdout.encoding defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG environment variable.
Solution
If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.
be 100% certain that the code is being executed by a Python3 interpreter - print sys.version_info from within the program.
try setting the PYTHONIOENCODING environment variable when running your script: PYTHONIOENCODING=UTF-8 python3 myscript.py
check your locale using the locale command in the terminal (or echo $LANG). If it doesn't end in UTF-8, consider changing it. Consult your system administrators if you are on a corporate machine.
if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.
Workaround
If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str.
There are four useful error handlers in your particular situation, their effects are demonstrated with this code:
>>> s = 'Hello \u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
... print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
...
Handler: ignore: Hello World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello \N{EN DASH} World
The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.
The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.
It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.
If you decided to use the replace handler, you would change your code like this:
for i in patchlets_in_latest_list:
replaced = i.encode('ascii', errors='replace').decode('ascii')
print(replaced)
wherever you are printing data that might contain non-ASCII characters.

Decoding HTML Entities to Unicode

Well, Since yesterday I'm having trouble with this. I need to save some text into a ".txt" file, the problem is that there are html entities in the text I'm trying to save.
So I imported HTMLPaser in my code:
import HTMLParser
h = HTMLParser.HTMLParser()
print h.unescape(text) // right?
the thing is that this works when you try to print the result, but i'm trying to return this to a function of mine which actually saves the text to the file. So, when I'm trying to save the file, the system says:
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 0: ordinal not in range(128)
I've been reading about this but I cannot conclude anything, I tried BeautifulSoup, I tried functions from famous pythonists and none worked. Can you help me with this? I need to save the text in the file as unicode and by unicode I understand it will save characters like: á, right?

"Save Unicode character to a file" is a different question from "Decoding HTML Entities to Unicode". Your code (h.unescape(text)) already decodes the html text correctly.
The exception is due to print unicode_text e.g.:
print u"\N{EURO SIGN}"
should produce a similar error.
If you're saving to a file by redirecting the output of the python script e.g.:
$ python -m your_module >output.txt #XXX raises an error for non-ascii data
then define PYTHONIOENCODING=utf-8 envvar (to save using utf-8 encoding):
$ PYTHONIOENCODING=utf-8 python -m your_module >output.txt
If you want to save to a file directly in your Python code, use io module:
import io
with io.open(filename, 'w', encoding='utf-8') as file:
file.write(h.unescape(text))

UnicodeDecodeError with an unicode variable

I'm using an unicode variable and replacing some characters, but when I try to process a certain value it raises an error of UnicodeDecodeError, when I have set at the beginning of python's file the coding.
I tried this coding: iso-8859-15, cp1251,and I took a look to this but doesn't when the variable's value contains this character: `
At the terminal this works:
a='Don\xb4t dream it\xb4s over'
a = a.replace("\xb4","'")
print a
output: Don't dream it's over
Why does it work in the terminal but not in my python's file?.

The code is working for me. Here's what I have done:
Copy the following code to a file, and name it as test.py
a='Don\xb4t dream it\xb4s over'
a = a.replace("\xb4","'")
print a
Run test.py python ./test.py, and here's the output
Don't dream it's over
My python version is Python 2.7.3

You need to decode from the proper code page into Unicode. Then if you need it in another code page (such as UTF-8) you can re-encode it. WHen you use print Python will try to encode it to the code page of your terminal automatically.
>>> a = a.decode('iso-8859-1')
>>> print a
Don´t dream it´s over
Edit: Trying to decipher the actual question is difficult. Perhaps you're trying to read the text from a file and that's what's not working? Again it's important to know the encoding of the file. Many modern files use UTF-8 encoding.
a = f.readline()
a = a.decode('utf-8')
print a

Why all those unicode commands works CORRECT in Python? They all print my character correctly, no matter what i do

Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?

You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?

All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.

What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).

Because you are correctly using the magic "coding comment," everything works as supposed.

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.