xml parsing using ElementTree - python

I have written a small function, which uses ElementTree to parse xml file,but it is throwing the following error "xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0". please find the code below
tree = ElementTree.parse(urllib2.urlopen('http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'))
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")

There are multiple problems with the site you are using:
Site you are using somehow doesn't honour type=xml you are sending as GET arg, instead you will need to send accept header, telling site that you accept XML else it returns JSON data
Site is not accepting content-type text/xml so you need to send application/xml
Your parse call is correct, it is wrongly mentioned in other answer that it should take data, instead parse takes file name or file type object
So here is the working code
import urllib2
from xml.etree import ElementTree
url = 'http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")
print hotel_list
output:
[<Element 'HotelList' at 0x248cd90>]
Note I am creating a Request object and passing Accept header
btw if site is returning JSON why you need to parse XML, parsing JSON is simpler and you will get a ready made python object.

Related

Can't extract JSON from an http request

I'm having problems getting data from an HTTP response. The format unfortunately comes back with '\n' attached to all the key/value pairs. JSON says it must be a str and not "bytes".
I have tried a number of fixes so my list of includes might look weird/redundant. Any suggestions would be appreciated.
#!/usr/bin/env python3
import urllib.request
from urllib.request import urlopen
import json
import requests
url = "http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL"
response = urlopen(url)
content = response.read()
print(content)
data = json.loads(content)
info = data[0]
print(info)
#got this far - planning to extract "id:" "22144"
When it comes to making requests in Python, I personally like to use the requests library. I find it easier to use.
import json
import requests
r = requests.get('http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL')
json_obj = json.loads(r.text[4:])
print(json_obj[0].get('id'))
The above solution prints: 22144
The response data had a couple unnecessary characters at the head, which is why I am only loading the relevant (json) portion of the response: r.text[4:]. This is the reason why you couldn't load it as json initially.
Bytes object has method decode() which converts bytes to string. Checking the response in the browser, seems there are some extra characters at the beginning of the string that needs to be removed (a line feed character, followed by two slashes: '\n//'). To skip the first three characters from the string returned by the decode() method we add [3:] after the method call.
data = json.loads(content.decode()[3:])
print(data[0]['id'])
The output is exactly what you expect:
22144
JSON says it must be a str and not "bytes".
Your content is "bytes", and you can do this as below.
data = json.loads(content.decode())

Python Urllib/Requests XML iterparse error

I am currently trying to fetch a XML from Wikipedia and parse it with XML. My general setup is the following:
import requests
import xml.etree.cElementTree as etree
payload = {'pages': 'Apple', 'action': 'submit', 'offset' : '2008-01-24 09:39:22'}
r = requests.post('http://en.wikipedia.org/w/index.php?title=Special:Export', params=payload, stream=True)
xmlIterator = etree.iterparse(r.raw, events=("start","end"))
When I do my parsing syntax, I get the following error:
for event, element in self.xmlIterator:
File "<string>", line 107, in next
ParseError: no element found: line 249375, column 2
I have tried the same approach with urllib receiving in the same error. It also just seems to happen for this specific XML, others work fine.
But the strange thing is as follows: if I store the response to a file and then pass the file to the XML parser it works fine. E.g.,:
open("test.xml","w").write(r.text.encode('utf-8'))
xmlIterator = etree.iterparse("test.xml", events=("start","end"))
Again, the same behavior for urllib.
Does anyone have an idea of what the problem could be?

xml.parsers.expat.ExpatError on parsing XML

I have been trying to retrieve information through HTTP queries, as an example
http://www.opencellid.org/cell/get?key=xxxxxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283
returns me a response in XML format, like
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
I have tried using the response and urllib modules to open the URL, and then parse using elementtree.ElementTree.
Code snippet:
url = 'http://www.opencellid.org/cell/get?key=xxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283 '
rss = parse(requests.get(url = url)).getroot()
pprint(rss)
I however get the following error:
xml.parsers.expat.ExpatError: junk after document element: line 5, column 0
Just printing the response yields the HTML success code. Some help please!
You forgot to call content on the response object. That's how you get the actual xml.
content = requests.get(url = url).content
rss = parse(content).getroot()
First thing I'd advise would be to save a text file only with the content of the xml:
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
just make sure there are no trailing characters at the end. Then check it the parsing works.
If it does, then you know its a communication problem, and then have to figure how to 'clean' up what you are receiving.
Good luck!

How do I fix a "JSONDecodeError: No JSON object could be decoded: line 1 column 0 (char 0)"?

I'm trying to get Twitter API search results for a given hashtag using Python, but I'm having trouble with this "No JSON object could be decoded" error. I had to add the extra % towards the end of the URL to prevent a string formatting error. Could this JSON error be related to the extra %, or is it caused by something else? Any suggestions would be much appreciated.
A snippet:
import simplejson
import urllib2
def search_twitter(quoted_search_term):
url = "http://search.twitter.com/search.json?callback=twitterSearch&q=%%23%s" % quoted_search_term
f = urllib2.urlopen(url)
json = simplejson.load(f)
return json
There were a couple problems with your initial code. First you never read in the content from twitter, just opened the url. Second in the url you set a callback (twitterSearch). What a call back does is wrap the returned json in a function call so in this case it would have been twitterSearch(). This is useful if you want a special function to handle the returned results.
import simplejson
import urllib2
def search_twitter(quoted_search_term):
url = "http://search.twitter.com/search.json?&q=%%23%s" % quoted_search_term
f = urllib2.urlopen(url)
content = f.read()
json = simplejson.loads(content)
return json

Parsing XML response of bit.ly

I was trying out the bit.ly api for shorterning and got it to work. It returns to my script an xml document. I wanted to extract out the tag but cant seem to parse it properly.
askfor = urllib2.Request(full_url)
response = urllib2.urlopen(askfor)
the_page = response.read()
So the_page contains the xml document. I tried:
from xml.dom.minidom import parse
doc = parse(the_page)
this causes an error. what am I doing wrong?
You don't provide an error message so I can't be sure this is the only error. But, xml.minidom.parse does not take a string. From the docstring for parse:
Parse a file into a DOM by filename or file object.
You should try:
response = urllib2.urlopen(askfor)
doc = parse(response)
since response will behave like a file object. Or you could use the parseString method in minidom instead (and then pass the_page as the argument).
EDIT: to extract the URL, you'll need to do:
url_nodes = doc.getElementsByTagName('url')
url = url_nodes[0]
print url.childNodes[0].data
The result of getElementsByTagName is a list of all nodes matching (just one in this case). url is an Element as you noticed, which contains a child Text node, which contains the data you need.
from xml.dom.minidom import parseString
doc = parseString(the_page)
See the documentation for xml.dom.minidom.

Categories

Resources