In my python program, I use untangle for parsing XML file:
from untangle import parse
parse(xml)
The XML is encoded in utf-8 and contains non-ASCII characters. In my program, this is causing trouble. When the xml string is passed to untangle, it tries to be smart and automatically check if it's a file name first. So it calls
os.path.exists(xml)
And it looks like the os module tries to convert it back to ascii and caused the following exception:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 169-172: ordinal not in range(128)
At the top of this file, I'm doing this as a trick that supposedly would work around this:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
Unfortunately, it didn't work.
I don't know what else can go wrong. Please help.
It’s a bit odd that untangle doesn’t offer direct functions for this.
The simplest solution would be to copy the relevant implementation of untangle.parse to parse files:
def parse_text (text):
parser = untangle.make_parser()
sax_handler = untangle.Handler()
parser.setContentHandler(sax_handler)
parser.parse(StringIO(content))
return sax_handler.root
Does decoding help for your case like below? Reloading sys and setting utf-8 as default is not a good habit.
from untangle import parse
xml=isinstance(xml, str) and xml.decode("utf-8") or xml
parse(xml)
Related
Apologies in advance for this post if it's not well written as I'm extremely new to Python. Pretty simple/stupid problem I'm having with Python3 and BeautifulSoup. I'm attempting to parse a CSV file in Python without knowing what the encoding of each line will contain as each line contains raw data from several sources. Before I can even parse the file, I'm using BeautifulSoup in an attempt to clean it up (I'm not sure if this is a good idea):
from bs4 import BeautifulSoup
def main():
try:
soup = BeautifulSoup(open('files/sdk_breakout_1027.csv'))
except Exception as e:
print(str(e))
When I run this, however, I encounter the following error:
'ascii' codec can't decode byte 0xed in position 287: ordinal not in range(128)
My traceback points to this line in the CSV as the source of the problem:
500i(í£ : Android OS : 4.0.4
What is a better way to go about this? I just want to convert all rows in this CSV to a uniform encoding so I can parse it later.
Thanks for your help.
Guessing the encoding of random data will never be perfect, but if you know something about your data source, you may be able to do that.
Alternatively, you can open as UTF-8 and either ignore or replace errors:
import csv
with open("filename", encoding="utf8", errors="replace") as f:
for row in csv.reader(f):
print(", ".join(row))
You can't parse a CSV file with BeautifulSoup, only HTML or XML.
If you want to use the charset guessing from BeautifulSoup on its own, you can. See the Unicode, Dammit section of the docs. If you have a complete list of all of the encodings that might have been used, but just don't know which one in that list was actually used, pass that list to Dammit.
There's a different charset-guessing library known as chardet that you also might want to try. (Note that Dammit will use chardet if you have it installed, so you might not need to try it separately.)
But both of these just make educated guesses; the documentation explains all the myriad ways they can fail.
Also, if each line is encoded differently (which is an even bigger mess than usual), you will have to Dammit or chardet each line as if it were a separate file. With much less text to work with, the guessing is going to be much less accurate, but there's nothing you can do about that if each line really is potentially in a different encoding.
Putting it all together, it would look something like this:
encodings = 'utf-8', 'latin-1', 'cp1252', 'shift-jis'
def dammitize(f):
for line in f:
yield UnicodeDammit(line, encodings).unicode_markup
with open('foo.csv', 'rb') as f:
for row in csv.reader(dammitize(f)):
do_something_with(row)
Apologies if this is a duplicate or something really obvious, but please bear with me as I'm new to Python. I'm trying to use cElementTree (Python 2.7.5) to parse an XML file within Applescript. The XML file contains some fields with non-ASCII text encoded as entities, such as <foo>café</foo>.
Running the following basic code in Terminal outputs pairs of tags and tag contents as expected:
import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
tree = etree.parse("myfile.xml", parser=parser)
root = tree.getroot()
for child in root:
print child.tag, child.text
But when I run that same code from within Applescript using do shell script, I get the dreaded UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128).
I found that if I change my print line to
print [child.tag, child.text]
then I do get a string containing XML tag/value pairs wrapped in [''], but any non-ASCII characters then get passed onto Applescript as the literal Unicode character string (so I end up with u'caf\\xe9').
I tried a couple of things, including a.) reading the .xml file into a string and using .fromstring instead of .parse, b.) trying to convert the .xml file to str before importing it into cElementTree, c.) just sticking .encode wherever I could to see if I could avoid the ASCII codec, but no solution yet. I'm stuck using Applescript as a container, unfortunately. Thanks in advance for advice!
You need to encode at least child.text into something that Applescript can handle. If you want the character entity references back, this will do it:
print child.tag.encode('ascii', 'xmlcharrefreplace'), child.text.encode('ascii', 'xmlcharrefreplace')
Or if it can handle something like utf-8:
print child.tag.encode('utf-8'), child.text.encode('utf-8')
Not AppleScript's fault - it's Python being "helpful" by guessing for you what output encoding to use. (Unfortunately, it guesses differently depending whether or not a terminal is attached.)
Simplest solution (Python 2.6+) is to set the PYTHONIOENCODING environment variable before invoking python:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python '/path/to/script.py'"
or:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python << EOF
# -*- coding: utf-8 -*-
# your Python code goes here...
print u'A Møøse once bit my sister ...'
EOF"
New to python and lxml so please bear with me. Now stuck with what appears to be unicode issue. I tried .encode, beautiful soup's unicodedammit with no luck. Had searched the forum and web, but my lack of python skill failed to apply suggested solution to my particular code. Appreciate any help, thanks.
Code:
import requests
import lxml.html
sourceUrl = "http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty.htm"
sourceHtml = requests.get(sourceUrl)
htmlTree = lxml.html.fromstring(sourceHtml.text)
for stockCodes in htmlTree.xpath('''/html/body/printfriendly/table/tr/td/table/tr/td/table/tr/table/tr/td'''):
string = stockCodes.text
print string
Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
When I run your code like this python lx.py, I don't get the error. But when I send the result to sdtout python lx.py > output.txt, it occurs. So try this:
# -*- coding: utf-8 -*-
import requests
import lxml.html
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
This allows you to switch from the default ASCII to UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
The text attribute always returns pure bytes but the content attribute should try to encode it for you. You could also try: sourceHTML.text.encode('utf-8') or sourceHTML.text.encode('ascii') but I'm fairly certain the latter will cause that same exception.
I have been reading left right and centre about unicode and python. I think I understand what encoding/decoding is, yet as soon as I try to use a standard library method manipulating a file name, I get the infamous:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19:
ordinal not in range(128)
In this case \xe9 stands for 'é', and it doesn't matter if I call it from a os.path.join() or a shutil.copy(), it throws the same error. From what I understand it has to do with the default encoding of python. I try to change it with:
# -*- coding: utf-8 -*-
Nothing changes. If I type:
sys.setdefaultencoding('utf-8')
it tells me:
ImportError: cannot import name setdefaultencoding
What I really don't understand is why it works when I type it in the terminal, '\xe9' and all. Could someone please explain to me why this is happening/how to get around it?
Thank you
Filenames on *nix cannot be manipulated as unicode. The filename must be encoded to match the charset of the filesystem and then used.
you should decode manually the filename with the correct encoding (latin1?) before os.path.join
btw: # -- coding: utf-8 -- refers to the string literals in your .py file
effbot has some good infos
You should not touch the default encoding. It is best practice and highly recommendable to keep it with 'ascii' and convert your data properly to utf-8 on the output side.
I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdé
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.