Use Minidom to parse XML But just crashes applet - python

Having some issues with Minidom for parsing an XML file on a remote server.
This is the code I am trying to parse:
<mod n="1">
<body>
Random Body information will be here
</body>
<b>1997-01-27</b>
<d>1460321480</d>
<l>United Kingdom</l>
<s>M</s>
<t>About Denisstoff</t>
</mod>
I'm trying to return the <d> values with Minidom. This is the code I am trying to use to find the value:
expired = True
f = urlreq.urlopen("http://st.chatango.com/profileimg/"+args[:1]+"/"+args[1:2]+"/"+args+"/mod1.xml")
data = f.read().decode("utf-8")
dom = minidom.parseString(data)
itemlist = dom.getElementsByTagName('d')
print(itemlist)
It returns the value is there, but I followed a way to read the data I found here (Below) and it just crashed my python app. This is the code I tried to fix with:
for s in itemlist:
if s.hasAttribute('d'):
print(s.attributes['d'].value)
This is the crash:
AttributeError: 'NodeList' object has no attribute 'value'
I also tried ElementTree but that didn't return any data at all. I have tested the URL and it's correct for the data I want, but I just can't get it to read the data in the tags. Any and all help is appreciated.

if you want to print values from this xml you should use this:
for s in itemlist:
if hasattr(s.childNodes[0], "data"):
print(s.childNodes[0].data)
I hope it help :D

Related

parsing invalid xml using xmltodict

I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])

Maximum recursion depth exceeded when building a whoosh index

I am trying to index some documents using Whoosh. However, when I try to add the documents to the Whoosh index, Python eventually gives back the following error:
RecursionError: maximum recursion depth exceeded while calling a Python object
I have tried playing with the limitmb setting of the index writer, as well as changing how often the index is committed to the hard drive. This seemed to change the amount of documents that were indexed succesfully, however the indexing stops with the RecursionError after a short while.
My code is the following:
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
from bs4 import BeautifulSoup
import os
schema = Schema(title=TEXT(stored=True), docID=ID(stored=True), content=TEXT(stored=True))
ix = create_in("index", schema)
writer = ix.writer(limitmb=1024, procs=4, multisegment=True);
for root, dirs, files in os.walk('aquaint'):
for file in files:
with open(os.path.join(root, file), "r") as f:
soup = BeautifulSoup(f.read(), 'html.parser')
for doc in soup.find_all('doc'):
try:
t = doc.find('headline').string
except:
t = "No title available"
try:
d = doc.find('docno').string
except:
d = "No docID available"
try:
c = doc.find('text').string
except:
c = "No content available"
writer.add_document(title=t, docID=d, content=c)
writer.commit()
The files I am loading in are from the TRAC robust track (https://trec.nist.gov/data/t14_robust.html) and have the following format (due to licensing I can't share the entire file):
<DOC>
<DOCNO> APW1XXXXXXXXX </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1998-01-06 00:17:00 </DATE_TIME>
<HEADER>
XXXX
</HEADER>
<BODY>
<SLUG> BC-Sports-Motorcycling-Grand Prix-Doohan </SLUG>
<HEADLINE>
Doohan calls for upgrade to 1000cc bikes
</HEADLINE>
<TEXT>
News article text here
</TEXT>
(PROFILE
(WS SL:BC-Sports-Motorcycling-Grand Prix-Doohan; CT:s;
(REG:EURO;)
(REG:BRIT;)
(REG:SCAN;)
(REG:MEST;)
(REG:AFRI;)
(REG:INDI;)
(REG:ENGL;)
(REG:ASIA;)
(LANG:ENGLISH;))
)
</BODY>
<TRAILER>
AP-NY-06-01-98 0017EDT
</TRAILER>
</DOC>
Each file loaded in includes several of these documents, beginning and ending with the <DOC> tags.
I don't understand what is causing this error, could someone help me out?
Your help is greatly appreciated!
I found what the problem was, I wrongly assumed that BeautifulSoup would return a string when calling doc.find('headline').string, replacing this with str(doc.find('headline').string) seems to have fixed the issue for me, and Whoosh is now indexing correctly.

no results when parsing CityGML in Python

I am working on CityGML data right now and try to parse CityGML in Python.
To do so, I use ElementTree, which is working fine with any XML files. But whenever I try to parse the CItyGML file I don't get any results.
As one example I want to get a list with all child tags named "creationDate" in the CityGML file. Here is the code:
import xml.etree.ElementTree as ET
tree = ET.parse('Gasometer.xml')
root = tree.getroot()
def child_list(child):
list_child = list(tree.iter(child))
return list_child
date = child_list('creationDate')
print (date)
I only get an empty list [].
Here is the the very first part of the CityGML file (the "creationDate"-tag you can find at the end):
<?xml version="1.0" encoding="UTF-8"?>
<CityModel>
<cityObjectMember>
<bldg:Building gml:id="UUID_899cac3f-e0b6-41e6-ae30-a91ce51d6d95">
<gml:description>Wohnblock in geschlossener Bauweise</gml:description>
<gml:boundedBy>
<gml:Envelope srsName="urn:ogc:def:crs,crs:EPSG::3068,crs:EPSG::5783" srsDimension="3">
<gml:lowerCorner>21549.6537889055 17204.3479916992 38.939998626709</gml:lowerCorner>
<gml:upperCorner>21570.6420902953 17225.660050148 60.6840192923434</gml:upperCorner>
</gml:Envelope>
</gml:boundedBy>
<creationDate>2014-03-28</creationDate>
This appears not only when I try to get lists of child tags. I can't print any attributes or tag names. It looks like the way I parse the file is wrong. I hope somebody can help me out with my problem and tell me what I should do! Thanks!
Since this is an old post I'll just leave this here in case someone else might need it.
To parse CityGML try the following code, it should help getting a general idea how to fetch the information.
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
print "ENV tag", envelope.tag
print "ENV attrib", envelope.attrib
print "ENV text", envelope.text
lCorner = envelope.find('{http://www.opengis.net/gml}lowerCorner').text
uCorner = envelope.find('{http://www.opengis.net/gml}upperCorner').text
print "lC",lCorner
print "uC",uCorner
if __name__== "__main__":
loadfile()
To get the srsName try following:
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
key = envelope.attrib
srsName = key.get('srsName')
print "SRS Name: ", srsName
if __name__== "__main__":
loadfile()
I hope this helps you or anyone else who might try parsing CityGML with ElementTree.

xml.parsers.expat.ExpatError on parsing XML

I have been trying to retrieve information through HTTP queries, as an example
http://www.opencellid.org/cell/get?key=xxxxxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283
returns me a response in XML format, like
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
I have tried using the response and urllib modules to open the URL, and then parse using elementtree.ElementTree.
Code snippet:
url = 'http://www.opencellid.org/cell/get?key=xxxxxxxxxx&mnc=1&mcc=228&lac=101&cellid=7283 '
rss = parse(requests.get(url = url)).getroot()
pprint(rss)
I however get the following error:
xml.parsers.expat.ExpatError: junk after document element: line 5, column 0
Just printing the response yields the HTML success code. Some help please!
You forgot to call content on the response object. That's how you get the actual xml.
content = requests.get(url = url).content
rss = parse(content).getroot()
First thing I'd advise would be to save a text file only with the content of the xml:
<rsp stat="ok">
<cell nbSamples="1" mnc="1" lac="101" lat="46.52079" lon="6.56676" cellId="7283" mcc="228" range="6000"/>
</rsp>
just make sure there are no trailing characters at the end. Then check it the parsing works.
If it does, then you know its a communication problem, and then have to figure how to 'clean' up what you are receiving.
Good luck!

Parsing XML response of bit.ly

I was trying out the bit.ly api for shorterning and got it to work. It returns to my script an xml document. I wanted to extract out the tag but cant seem to parse it properly.
askfor = urllib2.Request(full_url)
response = urllib2.urlopen(askfor)
the_page = response.read()
So the_page contains the xml document. I tried:
from xml.dom.minidom import parse
doc = parse(the_page)
this causes an error. what am I doing wrong?
You don't provide an error message so I can't be sure this is the only error. But, xml.minidom.parse does not take a string. From the docstring for parse:
Parse a file into a DOM by filename or file object.
You should try:
response = urllib2.urlopen(askfor)
doc = parse(response)
since response will behave like a file object. Or you could use the parseString method in minidom instead (and then pass the_page as the argument).
EDIT: to extract the URL, you'll need to do:
url_nodes = doc.getElementsByTagName('url')
url = url_nodes[0]
print url.childNodes[0].data
The result of getElementsByTagName is a list of all nodes matching (just one in this case). url is an Element as you noticed, which contains a child Text node, which contains the data you need.
from xml.dom.minidom import parseString
doc = parseString(the_page)
See the documentation for xml.dom.minidom.

Categories

Resources