Parse large python xml using xmltree - python

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()

Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

Related

Parse XML with Python resolving an external ENTITY reference

In my S1000D xml, it specifies a DOCTYPE with a reference to a public URL that contains references to a number of other files that contain all the valid character entities. I've used xml.etree.ElementTree and lxml to try to parse it and get a parse error with both indicating:
undefined entity −: line 82, column 652
Even though − is a valid entity according to the ENTITY Reference specfied.
The xml top is as follow:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dmodule [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
If you go out and get http://www.s1000d.org/S1000D_4-1/ent/ISOEntities, it will include 20 other ent files with one called iso-tech.ent which contains the line:
<!ENTITY minus "−"> <!-- MINUS SIGN -->
in line 82 of the xml file near column 652 is the following:
....Refer to 70−41....
How can I run a python script to parse this file without get the undefined entity?
Sorry I don't want to specify parser.entity['minus'] = chr(2212) for example. I did that for a quick fix but there are many character entity references.
I would like the parser to check Entity reference that is specified in the xml.
I'm surprised but I've gone around the sun and back and haven't found how to do this (or maybe I have but couldn't follow it).
if I update my xml file and add
<!ENTITY minus "−">
It won't fail, so It's not the xml.
It fails on the parse. Here's code I use for ElementTree
fl = os.path.join(pth, fn)
try:
root = ET.parse(fl)
except ParseError as p:
print("ParseError : ", p)
Here's the code I use for lxml
fl = os.path.join(pth, fn)
try:
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
root = etree.parse(fl, parser=parser)
except etree.XMLSyntaxError as pe:
print("lxml XMLSyntaxError: ", pe)
I would like the parser to load the ENTITY reference so that it knows that − and all the other character entities specified in all the files are valid entity characters.
Thank you so much for your advice and help.
I'm going to answer for lxml. No reason to consider ElementTree if you can use lxml.
I think the piece you're missing is no_network=False in the XMLParser; it's True by default.
Example...
XML Input (test.xml)
<!DOCTYPE doc [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
<doc>
<test>Here's a test of minus: −</test>
</doc>
Python
from lxml import etree
parser = etree.XMLParser(load_dtd=True,
no_network=False)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: −</test>
</doc>
If you wanted the entity reference retained, add resolve_entities=False to the XMLParser.
Also, instead of going out to an external location to resolve the parameter entity, consider setting up an XML Catalog. This will let you resolve public and/or system identifiers to local versions.
Example using same XML input above...
XML Catalog ("catalog.xml" in the directory "catalog test" (space used in directory name for testing))
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- The path in #uri is relative to this file (catalog.xml). -->
<uri name="http://www.s1000d.org/S1000D_4-1/ent/ISOEntities" uri="./ents/ISOEntities_stackoverflow.ent"/>
</catalog>
Entity File ("ISOEntities_stackoverflow.ent" in the directory "catalog test/ents". Changed the value to "BAM!" for testing)
<!ENTITY minus "BAM!">
Python (Changed no_network to True for additional evidence that the local version of http://www.s1000d.org/S1000D_4-1/ent/ISOEntities is being used.)
import os
from urllib.request import pathname2url
from lxml import etree
# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
try:
xcf_env = os.environ['XML_CATALOG_FILES']
except KeyError:
# Path to catalog must be a url.
catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog test/catalog.xml'))}"
# Temporarily set the environment variable.
os.environ['XML_CATALOG_FILES'] = catalog_path
parser = etree.XMLParser(load_dtd=True,
no_network=True)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: BAM!</test>
</doc>

How to parse out xml from noisy file using python

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.
File to parse:
Requesting event notifications...
Receiving command objects...
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected
Command execution successful...
Python:
import re
with open('./output.out', 'r') as outFile:
data = outFile.read().replace('\n','')
regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)
if m:
print(m.group())
else:
print('No match')
Output:
No match
What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.
Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.
import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service"
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1"
# version="2016.1.2700-SNAPSHOT"></data></root>]
roots[0] is an XML document. You can do anything you want with it.

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

Error while parsing xml file in python

This is the xml file I am trying to parse. This file does not have a root tag.
<data txt="some0" txt1 = "some1" txt2 = "some2" >
<data2>
< bank = "SBI" bank2 = "SBI2" >
<data2>
<data3>
<branch = "bang1" branch = bang"2" >
<data3>
<data>
My script contains below lines. The below can be used to get the specific data after parsing it.
data = re.findall("<data txt=.*?</data>", re.DOTALL)
tree = ElementTree.fromstringlist(data)
I am unabale to parse this file because its not having root tag. please help me how to parse if the file is having no tag ??
As pointed out in a comment already, you can just parse the whole thing. If the missing root element is the problem, you can grab the contents of the file as a string and then add an arbitrary root tag at the beginning and the end.
stringdata = "<myroot>%s</myroot>" % stringdata
and then parse the string.
EDIT:
In response to comment.
If you have one string, you'll want fromstring, but you'll almost certainly get the same error. Something else is going on. Try this ...
from xml.etree import ElementTree
stringdata = "<myroot>%s</myroot>" % stringdata
tree = ElementTree.fromstring(stringdata)
Then get what you need from tree.

Python lxml: Items without .text attribute returned when querying for nodes()

I am trying to parse out certain tags from an XML document and it is retiring an AttributeError: '_ElementStringResult' object has no attribute 'text' error.
Here is the xml document:
<?xml version='1.0' encoding='ASCII'?>
<Root>
<Data>
<FormType>Log</FormType>
<Submitted>2012-03-19 07:34:07</Submitted>
<ID>1234</ID>
<LAST>SJTK4</LAST>
<Latitude>36.7027777778</Latitude>
<Longitude>-108.046111111</Longitude>
<Speed>0.0</Speed>
</Data>
</Root>
Here is the code I am using
from lxml import etree
from StringIO import StringIO
import MySQLdb
import glob
import os
import shutil
import logging
import sys
localPath = "C:\data"
xmlFiles = glob.glob1(localPath,"*.xml")
for file in xmlFiles:
a = os.path.join(localPath,file)
element = etree.parse(a)
Data = element.xpath('//Root/Data/node()')
parsedData = [{field.tag: field.text for field in Data} for action in Data]
print parsedData #AttributeError: '_ElementStringResult' object has no attribute 'text'
'//Root/Data/node()' will return a list of all the child elements which include text elements as strings which will not have a text attribute. If you put a print right after the Data = ... you will see something like ['\n ', <Element FormType at 0x10675fdc0>, '\n ', ....
I would do a filter first such as:
Data = [f for f in elem.xpath('//Root/Data/node()') if hasattr(f, 'text')]
Then I think the following line could be rewritten as:
parsedData = {field.tag: field.text for field in Data}
which will give the element tag and text dictionary which I believe is what you want.
Instead of querying for //Root/Data/node(), query for /Root/Data/* if you want only elements (as opposed to text nodes) to be returned. (Also, using only a single leading / rather than // allows the engine to do a cheaper search, rather than needing to look through the whole subtree for an additional Root.
Also -- are you sure you really want to loop through the entire list of subelements of Data inside your inner loop, rather than looping over only the subelements of a single Data element selected by your outer loop? I think your logic is broken, though it would only be visible if you had a file with more than one Data element under Root.

Categories

Resources