I was trying out the bit.ly api for shorterning and got it to work. It returns to my script an xml document. I wanted to extract out the tag but cant seem to parse it properly.
askfor = urllib2.Request(full_url)
response = urllib2.urlopen(askfor)
the_page = response.read()
So the_page contains the xml document. I tried:
from xml.dom.minidom import parse
doc = parse(the_page)
this causes an error. what am I doing wrong?
You don't provide an error message so I can't be sure this is the only error. But, xml.minidom.parse does not take a string. From the docstring for parse:
Parse a file into a DOM by filename or file object.
You should try:
response = urllib2.urlopen(askfor)
doc = parse(response)
since response will behave like a file object. Or you could use the parseString method in minidom instead (and then pass the_page as the argument).
EDIT: to extract the URL, you'll need to do:
url_nodes = doc.getElementsByTagName('url')
url = url_nodes[0]
print url.childNodes[0].data
The result of getElementsByTagName is a list of all nodes matching (just one in this case). url is an Element as you noticed, which contains a child Text node, which contains the data you need.
from xml.dom.minidom import parseString
doc = parseString(the_page)
See the documentation for xml.dom.minidom.
Related
I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])
I am currently trying to fetch a XML from Wikipedia and parse it with XML. My general setup is the following:
import requests
import xml.etree.cElementTree as etree
payload = {'pages': 'Apple', 'action': 'submit', 'offset' : '2008-01-24 09:39:22'}
r = requests.post('http://en.wikipedia.org/w/index.php?title=Special:Export', params=payload, stream=True)
xmlIterator = etree.iterparse(r.raw, events=("start","end"))
When I do my parsing syntax, I get the following error:
for event, element in self.xmlIterator:
File "<string>", line 107, in next
ParseError: no element found: line 249375, column 2
I have tried the same approach with urllib receiving in the same error. It also just seems to happen for this specific XML, others work fine.
But the strange thing is as follows: if I store the response to a file and then pass the file to the XML parser it works fine. E.g.,:
open("test.xml","w").write(r.text.encode('utf-8'))
xmlIterator = etree.iterparse("test.xml", events=("start","end"))
Again, the same behavior for urllib.
Does anyone have an idea of what the problem could be?
I need to get the value from the FLVPath from this link : http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite
from lxml import html
sub_r = requests.get("http://www.testpage.co/v2/videoConfigXmlCode.php?pg=video_%s_no_0_extsite" % list[6])
sub_root = lxml.html.fromstring(sub_r.content)
for sub_data in sub_root.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print sub_data.text
But no data returned
You're using lxml.html to parse the document, which causes lxml to lowercase all element and attribute names (since that doesn't matter in html), which means you'll have to use:
sub_root.xpath('//player_settings[#name="FLVPath"]/#value')
Or as you're parsing a xml file anyway, you could use lxml.etree.
You could try
print sub_data.attrib['Value']
url = "http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite"
response = requests.get(url)
# Use `lxml.etree` rathern than `lxml.html`,
# and unicode `response.text` instead of `response.content`
doc = lxml.etree.fromstring(response.text)
for path in doc.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print path
I have written a small function, which uses ElementTree to parse xml file,but it is throwing the following error "xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0". please find the code below
tree = ElementTree.parse(urllib2.urlopen('http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'))
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")
There are multiple problems with the site you are using:
Site you are using somehow doesn't honour type=xml you are sending as GET arg, instead you will need to send accept header, telling site that you accept XML else it returns JSON data
Site is not accepting content-type text/xml so you need to send application/xml
Your parse call is correct, it is wrongly mentioned in other answer that it should take data, instead parse takes file name or file type object
So here is the working code
import urllib2
from xml.etree import ElementTree
url = 'http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&apiKey=czztdaxrhfbusyp685ut6g6v&cid=8123&locale=en_US&city=Dallas%20&stateProvinceCode=TX&countryCode=US&minorRev=12'
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
hotel_list = rootElem.findall("HotelList")
print hotel_list
output:
[<Element 'HotelList' at 0x248cd90>]
Note I am creating a Request object and passing Accept header
btw if site is returning JSON why you need to parse XML, parsing JSON is simpler and you will get a ready made python object.
I have been trying to retrieve information through HTTP queries, as an example
http://www.opencellid.org/cell/get?key=xxxxxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283
returns me a response in XML format, like
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
I have tried using the response and urllib modules to open the URL, and then parse using elementtree.ElementTree.
Code snippet:
url = 'http://www.opencellid.org/cell/get?key=xxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283 '
rss = parse(requests.get(url = url)).getroot()
pprint(rss)
I however get the following error:
xml.parsers.expat.ExpatError: junk after document element: line 5, column 0
Just printing the response yields the HTML success code. Some help please!
You forgot to call content on the response object. That's how you get the actual xml.
content = requests.get(url = url).content
rss = parse(content).getroot()
First thing I'd advise would be to save a text file only with the content of the xml:
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
just make sure there are no trailing characters at the end. Then check it the parsing works.
If it does, then you know its a communication problem, and then have to figure how to 'clean' up what you are receiving.
Good luck!