parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) - python

I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

You are using the decoded unicode value. Use r.raw raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True option to .get().
Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

Correction!
See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml
ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
Which basically means that #martijn-pieters was right, we must use the raw response as returned by .content
Incorrect answer (but might be interesting to someone)
For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on
HTTP headers, following RFC 2616 to the letter. If you can take
advantage of non-HTTP knowledge to make a better guess at the
encoding, you should set r.encoding appropriately before accessing
this property.
So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'
This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.

Understand the question has already got its answer, I faced this similar issue on Python3 and it worked fine on Python2. My resolution was: str_xml.encode() and then xml = etree.fromstring(str_xml) and then the parsing and extractions of tags and attributes.

Related

UnicodeDecodeError error with every urllib.request

When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))

Requests library crashing on Python 2 and Python 3 with

I am trying to parse arbitrary webpages with the requests and BeautifulSoup libraries with this code:
try:
response = requests.get(url)
except Exception as error:
return False
if response.encoding == None:
soup = bs4.BeautifulSoup(response.text) # This is line 809
else:
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
On most webpages this works fine. However, on some arbitrary pages (<1%) I get this crash:
Traceback (most recent call last):
File "/home/dotancohen/code/parser.py", line 155, in has_css
soup = bs4.BeautifulSoup(response.text)
File "/usr/lib/python3/dist-packages/requests/models.py", line 809, in text
content = str(self.content, encoding, errors='replace')
TypeError: str() argument 2 must be str, not None
For reference, this is the relevent method of the requests library:
#property
def text(self):
"""Content of the response, in unicode.
if Response.encoding is None and chardet module is available, encoding
will be guessed.
"""
# Try charset from content-type
content = None
encoding = self.encoding
# Fallback to auto-detected encoding.
if self.encoding is None:
if chardet is not None:
encoding = chardet.detect(self.content)['encoding']
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace') # This is line 809
except LookupError:
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
As can be seen, I am not passing in an encoding when this error is thrown. How am I using the library incorrectly, and what can I do to prevent this error? This is on Python 3.2.3, but I can also get the same results with Python 2.
This means that the server did not send an encoding for the content in the headers, and the chardet library was also not able to determine an encoding for the contents. You in fact deliberately test for the lack of encoding; why try to get decoded text if no encoding is available?
You can try to leave the decoding up to the BeautifulSoup parser:
if response.encoding is None:
soup = bs4.BeautifulSoup(response.content)
and there is no need to pass in the encoding to BeautifulSoup, since if .text does not fail, you are using Unicode and BeautifulSoup will ignore the encoding parameter anyway:
else:
soup = bs4.BeautifulSoup(response.text)

Url open encoding

I have the following code for urllib and BeautifulSoup:
getSite = urllib.urlopen(pageName) # open current site
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags
defLinks.append(value.get('href'))
The result of it:
/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
And when i try to read the site i get:
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z
The page is in UTF-8, but the server is sending it to you in a compressed format:
>>> print getSite.headers['content-encoding']
gzip
You'll need to decompress the data before running it through Beautiful Soup. I got an error using zlib.decompress() on the data, but writing the data to a file and using gzip.open() to read from it worked fine--I'm not sure why.
BeautifulSoup works with Unicode internally; it'll try and decode non-unicode responses from UTF-8 by default.
It looks like the site you are trying to load is using a different encode; for example, it could be UTF-16 instead:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽
It could be mac_cyrillic too:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ#пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ
But I have way too little information about what kind of site you are trying to load nor can I read the output of either encoding. :-)
You'll need to decode the result of getSite() before passing it to BeautifulSoup:
getSite = urllib.urlopen(pageName).decode('utf-16')
Generally, the website will return what encoding was used in the headers, in the form of a Content-Type header (probably text/html; charset=utf-16 or similar).
I ran into the same problem, and as Leonard mentioned, it was due to a compressed format.
This link solved it for me which says to add ('Accept-Encoding', 'gzip,deflate') in the request header. For example:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Where the decode() function is defined by:
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page

Parsing XML/HTML encoded GChats

I'm attempting to learn XML in order to parse GChats downloaded from GMail via IMAP. To do so I am using lxml. Each line of the chat messages is formatted like so:
<cli:message to="email#gmail.com" iconset="square" from="email#gmail.com" int:cid="insertid" int:sequence-no="1" int:time-stamp="1236608405935" xmlns:int="google:internal" xmlns:cli="jabber:client">
<cli:body>Nikko</cli:body>
<met:google-mail-signature xmlns:met="google:metadata">0c7ef6e618e9876b</met:google-mail- signature>
<x stamp="20090309T14:20:05" xmlns="jabber:x:delay"/>
<time ms="1236608405975" xmlns="google:timestamp"/>
</cli:message>
When I try to build the XML tree like so:
root = etree.Element("cli:message")
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2568, in lxml.etree.Element (src/lxml/lxml.etree.c:52878)
File "apihelpers.pxi", line 126, in lxml.etree._makeElement (src/lxml/lxml.etree.c:11497)
File "apihelpers.pxi", line 1542, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:23956)
ValueError: Invalid tag name u'cli:message'
When I try to escape it like so:
root = etree.Element("cli\:message")
I get the exact same error.
The header of the chats also gives this information, which seems relevant:
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 7bit
Does anyone know what's going on here?
So this didn't get any response, but in case anyone was wondering, BeautifulSoup worked fantastically for this. All I had to do was this:
soup = BeautifulSoup(repr(msg_data))
print(soup.get_text())
And I got (fairly) clear text.
So the reason you got an invalid tag is that if you were to look at the way lxml parses xml it doesn't use the namespace "cli" it would look instead like:
{url_where_Cli_is_define}Message
If you refer to Automatic XSD validation you will see what i did to simplify managing large amounts of schemas etc..
similarly what i did to avoid this very problem you would just replace the namespace using str.replace() to change the "cli:" to "{url}". having placed all the namespaces in one dictionary made this process quick.
I imagine soup does this process for you automatically.

urllib2, Google App Engine, and unicode question

Hey guys, I'm just learning google app engine so I'm running into a bunch of problems...
My current predicament is this. I have a database,
class Website(db.Model):
web_address = db.StringProperty()
company_name = db.StringProperty()
content = db.TextProperty()
div_section = db.StringProperty()
local_links = db.StringProperty()
absolute_links = db.BooleanProperty()
date_updated = db.DateTimeProperty()
and the problem i'm having is with the content property.
I'm using the db.TextProperty() because I need to store the contents of a webpage which have >500 bytes.
The problem i'm running into is urllib2.readlines() formats as unicode. When putting into a TextProperty() it's converting to ASCII. some of the characters are >128 and it throws a UnicodeDecodeError.
Is there a simple way to bypass this? For the most part, I don't care about those characters...
my error is:
Traceback (most recent call last):
File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py",
line 511, in call
handler.get(*groups) File "/base/data/home/apps/game-job-finder/1.346504560470727679/main.py", line 61, in get
x.content = website_data_joined File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py",
line 542, in set
value = self.validate(value) File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py",
line 2407, in validate
value = self.data_type(value) File
"/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore_types.py",
line 1006, in new
return super(Text, cls).new(cls, arg, encoding)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xc2 in position
2124: ordinal not in range(128)
It would appear that the lines returned from readlines are not unicode strings, but rather byte strings (i.e. instances of str containing potentially non-ASCII characters). These bytes are the raw data received in the HTTP response body, and will represent different strings depending on the encoding used. They need to be "decoded" before they can be treated as text (bytes != characters).
If the encoding is UTF-8, this code should work properly:
f = urllib2.open('http://www.google.com')
website = Website()
website.content = db.Text(f.read(), encoding = 'utf-8-sig') # 'sig' deals with BOM if present
Note that the actual encoding varies from website to website (sometimes even from page to page). The encoding used should be included in the Content-Type header in the HTTP response (see this question for how to get it), but if it's not, it may be included in a meta tag in the head of the HTML (in which case extracting properly is much more tricky):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Note that there are sites that do not specify an encoding, or specify the wrong encoding.
If you really don't care about any characters but ASCII, you can ignore them and be done with it:
f = urllib2.open('http://www.google.com')
website = Website()
content = unicode(f.read(), errors = 'ignore') # Ignore characters that cause errors
website.content = db.Text(content) # Don't need to specify an encoding since content is already a unicode string

Categories

Resources