Parsing XML/HTML encoded GChats - python

I'm attempting to learn XML in order to parse GChats downloaded from GMail via IMAP. To do so I am using lxml. Each line of the chat messages is formatted like so:
<cli:message to="email#gmail.com" iconset="square" from="email#gmail.com" int:cid="insertid" int:sequence-no="1" int:time-stamp="1236608405935" xmlns:int="google:internal" xmlns:cli="jabber:client">
<cli:body>Nikko</cli:body>
<met:google-mail-signature xmlns:met="google:metadata">0c7ef6e618e9876b</met:google-mail- signature>
<x stamp="20090309T14:20:05" xmlns="jabber:x:delay"/>
<time ms="1236608405975" xmlns="google:timestamp"/>
</cli:message>
When I try to build the XML tree like so:
root = etree.Element("cli:message")
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2568, in lxml.etree.Element (src/lxml/lxml.etree.c:52878)
File "apihelpers.pxi", line 126, in lxml.etree._makeElement (src/lxml/lxml.etree.c:11497)
File "apihelpers.pxi", line 1542, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:23956)
ValueError: Invalid tag name u'cli:message'
When I try to escape it like so:
root = etree.Element("cli\:message")
I get the exact same error.
The header of the chats also gives this information, which seems relevant:
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 7bit
Does anyone know what's going on here?

So this didn't get any response, but in case anyone was wondering, BeautifulSoup worked fantastically for this. All I had to do was this:
soup = BeautifulSoup(repr(msg_data))
print(soup.get_text())
And I got (fairly) clear text.

So the reason you got an invalid tag is that if you were to look at the way lxml parses xml it doesn't use the namespace "cli" it would look instead like:
{url_where_Cli_is_define}Message
If you refer to Automatic XSD validation you will see what i did to simplify managing large amounts of schemas etc..
similarly what i did to avoid this very problem you would just replace the namespace using str.replace() to change the "cli:" to "{url}". having placed all the namespaces in one dictionary made this process quick.
I imagine soup does this process for you automatically.

Related

How to fix bs4 select error: 'TypeError: __init__() keywords must be strings'

I'm writing a script that uses a post request and gets an XML in return. I need to parse that XML to know if the post request was accepted or not.
I'm using bs4 to parse it and it worked fine until about a week ago when I started to get an error I didn't get before:
TypeError: __init__() keywords must be strings
I'm using bs4's select function in other parts of the same file without getting this error, and I can't find anything about it online.
At first I thought it was a version issue, but I tried both python3.7 and 3.6 and got the same error.
This is the code used to produce the error:
res = requests.post(url, data = body, headers = headers)
logging.debug('Res HTTP status is {}'.format(res.status_code))
try:
res.raise_for_status()
resSoup = BeautifulSoup(res.text, 'xml')
# get the resultcode from the resultcode tag
resCode = resSoup.select_one('ResultCode').text
Full error messege:
Traceback (most recent call last):
File "EbarInt.py", line 292, in <module>
resCode = resSoup.select_one('ResultCode').text
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1345, in select_one
value = self.select(selector, namespaces, 1, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 50, in compile
namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings
When I check res.text type I get class 'str' as expected.
When I log res.text I get:
<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><soap:Header><wsa:Action>Trackem.Web.Services/CreateOrUpdateTaskResponse</wsa:Action><wsa:MessageID>urn:uuid:3ecae312-d416-40a5-a6a3-9607ebf28d7a</wsa:MessageID><wsa:RelatesTo>urn:uuid:6ab7e354-6499-4e37-9d6e-61219bac11f6</wsa:RelatesTo><wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To><wsse:Security><wsu:Timestamp wsu:Id="Timestamp-6b84a16f-327b-42db-987f-7f1ea52ef802"><wsu:Created>2019-01-06T10:33:08Z</wsu:Created><wsu:Expires>2019-01-06T10:38:08Z</wsu:Expires></wsu:Timestamp></wsse:Security></soap:Header><soap:Body><CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services"><CreateOrUpdateTaskResult><ResultCode>OK</ResultCode><ResultCodeAsInt>0</ResultCodeAsInt><TaskNumber>18000146</TaskNumber></CreateOrUpdateTaskResult></CreateOrUpdateTaskResponse></soap:Body></soap:Envelope>
Update: BeautifulSoup 4.7.1 has been released, fixing the default-namespace issue. See the release notes. You probably would want to upgrade just for the performance fixes.
Original answer:
You must have upgraded to BeautifulSoup 4.7, which replaced the simple and limited internal CSS parser with the soupsieve project, which is a far more complete CSS implementation.
It is that project that has an issue with the default namespace attached to one of the elements in your response:
<CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services">
The XML parser used to build the BeautifulSoup object tree correctly communicates that as the None -> 'Trackem.Web.Services' mapping in the namespace dictionary, but the soupsieve code required that all namespaces have a prefix name (xmlns:prefix) with the default namespace marked with an empty string, not None, leading to this bug. I've reported this as issue #68 to the soupsieve project.
You don't need to use select_one at all here, you are not using any CSS syntax beyond an element name. Use soup.find() instead:
resCode = resSoup.find('ResultCode').text

Using BeautifulSoup on very large HTML file - memory error?

I'm learning Python by working on a project - a Facebook message analyzer. I downloaded my data, which includes a messages.htm file of all my messages. I'm trying to write a program to parse this file and output data (# of messages, most common words, etc.)
However, my messages.htm file is 270MB. When creating a BeautifulSoup object in the shell for testing, any other file (all < 1MB) works just fine. But I can't create a bs object of messages.htm. Here's the error:
>>> mf = open('messages.htm', encoding="utf8")
>>> ms = bs4.BeautifulSoup(mf)
Traceback (most recent call last):
File "<pyshell#73>", line 1, in <module>
ms = bs4.BeautifulSoup(mf)
File "C:\Program Files (x86)\Python\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
File "C:\Program Files (x86)\Python\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError
So I can't even begin working with this file. This is my first time tackling something like this and I'm only just learning Python so any suggestions would be much appreciated!
As you're using this as a learning exercise, I won't give too much code. You may be better off with ElementTree's iterparse to allow you to process as you parse. BeautifulSoup doesn't have this functionality as far as I am aware.
To get you started:
import xml.etree.cElementTree as ET
with open('messages.htm') as source:
# get an iterable
context = ET.iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
# do something with elem
# get rid of the elements after processing
root.clear()
If you're set on using BeautifulSoup, you could look into splitting the source HTML into manageable chunks, but you'd need to be careful to keep the thread-message structure and ensure you keep valid HTML.

Python Urllib/Requests XML iterparse error

I am currently trying to fetch a XML from Wikipedia and parse it with XML. My general setup is the following:
import requests
import xml.etree.cElementTree as etree
payload = {'pages': 'Apple', 'action': 'submit', 'offset' : '2008-01-24 09:39:22'}
r = requests.post('http://en.wikipedia.org/w/index.php?title=Special:Export', params=payload, stream=True)
xmlIterator = etree.iterparse(r.raw, events=("start","end"))
When I do my parsing syntax, I get the following error:
for event, element in self.xmlIterator:
File "<string>", line 107, in next
ParseError: no element found: line 249375, column 2
I have tried the same approach with urllib receiving in the same error. It also just seems to happen for this specific XML, others work fine.
But the strange thing is as follows: if I store the response to a file and then pass the file to the XML parser it works fine. E.g.,:
open("test.xml","w").write(r.text.encode('utf-8'))
xmlIterator = etree.iterparse("test.xml", events=("start","end"))
Again, the same behavior for urllib.
Does anyone have an idea of what the problem could be?

Parse HTML table in file to csv with BeautifulSoup

Hi I'm a Python noob and even bigger BeautifulSoup and html noob. I have a file downloaded that has an html table in it. In all the examples of BeautifulSoup parsing I have seen, they all use urllib to access the table url and then the read the response and pass it to BeautifulSoup to parse. My question is, for a locally stored file, do I have to load the entire file into memory? So instead of doing say:
contenturl = "http://www.bank.gov.ua/control/en/curmetal/detail/currency?period=daily"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
Do I instead do:
soup = BeautifulSoup(open('/home/dir/filename').read())
That doesn't really seem to work. So I get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
TypeError: 'module' object is not callable
My apologies if its something really silly I'm doing but help is appreciated
Update: Issue is resolved, need to import class from module for BeautifulSoup. Thank you!

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?
You are using the decoded unicode value. Use r.raw raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True option to .get().
Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
Correction!
See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml
ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
Which basically means that #martijn-pieters was right, we must use the raw response as returned by .content
Incorrect answer (but might be interesting to someone)
For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on
HTTP headers, following RFC 2616 to the letter. If you can take
advantage of non-HTTP knowledge to make a better guess at the
encoding, you should set r.encoding appropriately before accessing
this property.
So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'
This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.
Understand the question has already got its answer, I faced this similar issue on Python3 and it worked fine on Python2. My resolution was: str_xml.encode() and then xml = etree.fromstring(str_xml) and then the parsing and extractions of tags and attributes.

Categories

Resources