Python Urllib/Requests XML iterparse error - python

I am currently trying to fetch a XML from Wikipedia and parse it with XML. My general setup is the following:
import requests
import xml.etree.cElementTree as etree
payload = {'pages': 'Apple', 'action': 'submit', 'offset' : '2008-01-24 09:39:22'}
r = requests.post('http://en.wikipedia.org/w/index.php?title=Special:Export', params=payload, stream=True)
xmlIterator = etree.iterparse(r.raw, events=("start","end"))
When I do my parsing syntax, I get the following error:
for event, element in self.xmlIterator:
File "<string>", line 107, in next
ParseError: no element found: line 249375, column 2
I have tried the same approach with urllib receiving in the same error. It also just seems to happen for this specific XML, others work fine.
But the strange thing is as follows: if I store the response to a file and then pass the file to the XML parser it works fine. E.g.,:
open("test.xml","w").write(r.text.encode('utf-8'))
xmlIterator = etree.iterparse("test.xml", events=("start","end"))
Again, the same behavior for urllib.
Does anyone have an idea of what the problem could be?

Related

parsing invalid xml using xmltodict

I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])

How to fix bs4 select error: 'TypeError: __init__() keywords must be strings'

I'm writing a script that uses a post request and gets an XML in return. I need to parse that XML to know if the post request was accepted or not.
I'm using bs4 to parse it and it worked fine until about a week ago when I started to get an error I didn't get before:
TypeError: __init__() keywords must be strings
I'm using bs4's select function in other parts of the same file without getting this error, and I can't find anything about it online.
At first I thought it was a version issue, but I tried both python3.7 and 3.6 and got the same error.
This is the code used to produce the error:
res = requests.post(url, data = body, headers = headers)
logging.debug('Res HTTP status is {}'.format(res.status_code))
try:
res.raise_for_status()
resSoup = BeautifulSoup(res.text, 'xml')
# get the resultcode from the resultcode tag
resCode = resSoup.select_one('ResultCode').text
Full error messege:
Traceback (most recent call last):
File "EbarInt.py", line 292, in <module>
resCode = resSoup.select_one('ResultCode').text
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1345, in select_one
value = self.select(selector, namespaces, 1, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 50, in compile
namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings
When I check res.text type I get class 'str' as expected.
When I log res.text I get:
<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><soap:Header><wsa:Action>Trackem.Web.Services/CreateOrUpdateTaskResponse</wsa:Action><wsa:MessageID>urn:uuid:3ecae312-d416-40a5-a6a3-9607ebf28d7a</wsa:MessageID><wsa:RelatesTo>urn:uuid:6ab7e354-6499-4e37-9d6e-61219bac11f6</wsa:RelatesTo><wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To><wsse:Security><wsu:Timestamp wsu:Id="Timestamp-6b84a16f-327b-42db-987f-7f1ea52ef802"><wsu:Created>2019-01-06T10:33:08Z</wsu:Created><wsu:Expires>2019-01-06T10:38:08Z</wsu:Expires></wsu:Timestamp></wsse:Security></soap:Header><soap:Body><CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services"><CreateOrUpdateTaskResult><ResultCode>OK</ResultCode><ResultCodeAsInt>0</ResultCodeAsInt><TaskNumber>18000146</TaskNumber></CreateOrUpdateTaskResult></CreateOrUpdateTaskResponse></soap:Body></soap:Envelope>
Update: BeautifulSoup 4.7.1 has been released, fixing the default-namespace issue. See the release notes. You probably would want to upgrade just for the performance fixes.
Original answer:
You must have upgraded to BeautifulSoup 4.7, which replaced the simple and limited internal CSS parser with the soupsieve project, which is a far more complete CSS implementation.
It is that project that has an issue with the default namespace attached to one of the elements in your response:
<CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services">
The XML parser used to build the BeautifulSoup object tree correctly communicates that as the None -> 'Trackem.Web.Services' mapping in the namespace dictionary, but the soupsieve code required that all namespaces have a prefix name (xmlns:prefix) with the default namespace marked with an empty string, not None, leading to this bug. I've reported this as issue #68 to the soupsieve project.
You don't need to use select_one at all here, you are not using any CSS syntax beyond an element name. Use soup.find() instead:
resCode = resSoup.find('ResultCode').text

xml parsing using ElementTree

I have written a small function, which uses ElementTree to parse xml file,but it is throwing the following error "xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0". please find the code below
tree = ElementTree.parse(urllib2.urlopen('http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'))
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")
There are multiple problems with the site you are using:
Site you are using somehow doesn't honour type=xml you are sending as GET arg, instead you will need to send accept header, telling site that you accept XML else it returns JSON data
Site is not accepting content-type text/xml so you need to send application/xml
Your parse call is correct, it is wrongly mentioned in other answer that it should take data, instead parse takes file name or file type object
So here is the working code
import urllib2
from xml.etree import ElementTree
url = 'http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")
print hotel_list
output:
[<Element 'HotelList' at 0x248cd90>]
Note I am creating a Request object and passing Accept header
btw if site is returning JSON why you need to parse XML, parsing JSON is simpler and you will get a ready made python object.

xml.parsers.expat.ExpatError on parsing XML

I have been trying to retrieve information through HTTP queries, as an example
http://www.opencellid.org/cell/get?key=xxxxxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283
returns me a response in XML format, like
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
I have tried using the response and urllib modules to open the URL, and then parse using elementtree.ElementTree.
Code snippet:
url = 'http://www.opencellid.org/cell/get?key=xxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283 '
rss = parse(requests.get(url = url)).getroot()
pprint(rss)
I however get the following error:
xml.parsers.expat.ExpatError: junk after document element: line 5, column 0
Just printing the response yields the HTML success code. Some help please!
You forgot to call content on the response object. That's how you get the actual xml.
content = requests.get(url = url).content
rss = parse(content).getroot()
First thing I'd advise would be to save a text file only with the content of the xml:
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
just make sure there are no trailing characters at the end. Then check it the parsing works.
If it does, then you know its a communication problem, and then have to figure how to 'clean' up what you are receiving.
Good luck!

Parsing XML response of bit.ly

I was trying out the bit.ly api for shorterning and got it to work. It returns to my script an xml document. I wanted to extract out the tag but cant seem to parse it properly.
askfor = urllib2.Request(full_url)
response = urllib2.urlopen(askfor)
the_page = response.read()
So the_page contains the xml document. I tried:
from xml.dom.minidom import parse
doc = parse(the_page)
this causes an error. what am I doing wrong?
You don't provide an error message so I can't be sure this is the only error. But, xml.minidom.parse does not take a string. From the docstring for parse:
Parse a file into a DOM by filename or file object.
You should try:
response = urllib2.urlopen(askfor)
doc = parse(response)
since response will behave like a file object. Or you could use the parseString method in minidom instead (and then pass the_page as the argument).
EDIT: to extract the URL, you'll need to do:
url_nodes = doc.getElementsByTagName('url')
url = url_nodes[0]
print url.childNodes[0].data
The result of getElementsByTagName is a list of all nodes matching (just one in this case). url is an Element as you noticed, which contains a child Text node, which contains the data you need.
from xml.dom.minidom import parseString
doc = parseString(the_page)
See the documentation for xml.dom.minidom.

Categories

Resources