Showing progress of python's XML parser when loading a huge file

Showing progress of python's XML parser when loading a huge file - python

Im using Python's built in XML parser to load a 1.5 gig XML file and it takes all day.
from xml.dom import minidom
xmldoc = minidom.parse('events.xml')
I need to know how to get inside that and measure its progress so I can show a progress bar.
any ideas?
minidom has another method called parseString() that returns a DOM tree assuming the string you pass it is valid XML, If I were to split up the file myself into chunks and pass them to parseString one at a time, could I possibly merge all the DOM trees back together at the end?

you usecase requires that you use sax parser instead of dom, dom loads everything in memory , sax instead will do line by line parsing and you write handlers for events as you need
so could be effective and you would be able to write progress indicator also
I also recommend trying expat parser sometime it is very useful
http://docs.python.org/library/pyexpat.html
for progress using sax:
as sax reads file incrementally you can wrap the file object you pass with your own and keep track how much have been read.
edit:
I also don't like idea of splitting file yourselves and joining DOM at end, that way you are better writing your own xml parser, i recommend instead using sax parser
I also wonder what your purpose of reading 1.5 gig file in DOM tree?
look like sax would be better here

Did you consider to use other means of parsing XML? Building a tree of such big XML files will always be slow and memory intensive. If you don't need the whole tree in memory, stream based parsing will be much faster. It can be a little daunting if you're used to tree based XML manipulation, but it will pay of in form of a huge speed increase (minutes instead of hours).
http://docs.python.org/library/xml.sax.html

I have something very similar for PyGTK, not PyQt, using the pulldom api. It gets called a little bit at a time using Gtk idle events (so the GUI doesn't lock up) and Python generators (to save the parsing state).
def idle_handler (fn):
fh = open (fn) # file handle
doc = xml.dom.pulldom.parse (fh)
fsize = os.stat (fn)[stat.ST_SIZE]
position = 0
for event, node in doc:
if position != fh.tell ():
position = fh.tell ()
# update status: position * 100 / fsize
if event == ....
yield True # idle handler stays until False is returned
yield False
def main:
add_idle_handler (idle_handler, filename)

Merging the tree at the end would be pretty easy. You could just create a new DOM, and basically append the individual trees to it one by one. This would give you pretty finely tuned control over the progress of the parsing too. You could even parallelize it if you wanted by spawning different processes to parse each section. You just have to make sure you split it intelligently (not splitting in the middle of a tag, etc.).

Related

Processing large xml files. Only root tree children attributes are relevant

I'm new to xml and python and I hope that I phrased my problem right:
I have xml files with a size of one gigabyte.
The files look like this:
<test name="LongTestname" result="PASS">
<step ID="0" step="NameOfStep1" result="PASS">
Stuff I dont't care about
</step>
<step ID="1" step="NameOfStep2" result="PASS">
Stuff I dont't care about
</step>
</test>
For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.
I have already tried following:
tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here I get a memory error because the files are to big
Then I tried:
try:
for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
if elem.tag == "step" and event == "start":
stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
elem.clear()
This works but is really slow. I guess it iterates through all elements and this takes a very long time.
Then I found a solution looking like this:
try:
tree = ET.iterparse(pathToSteps, events=("start","end"))
_, root = next(tree)
print('ROOT:', root.tag)
except:
print("ERROR: Unable to open and parse file !!!")
for child in root:
print(child.attrib)
But this prints only the attributes of the first step.
Is there a way to speed up the working solution?
Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.

I think you're on the right track with iterparse().
Maybe try specifying the step element name in the tag argument and only processing "start" events...
from lxml import etree
for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
print(elem.attrib)
elem.clear()
EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.

Without knowing the specifics of your setup, it might be hard to guess what the 'fastest possible' might be and how much of the delay is due to the parsing of the file. The first thing I would do, is of course time the run so you have some initial benchmark. Then I would write a simple python program that does nothing else but read the file from disk (no XML parsing). If the time difference is not significant, then the XML parsing isn't the issue and it is the reading of the file from disk is the problem. Of course, in an XML document, there is no indication in the file itself where the next tag ends so skipping the IO associated with those portions isn't possible (you still need to do a linear read of the file). Other than potentially using a different programming language (non-interpreted), there may not be many things you can do.
If you do get a significant slowdown from the actual XML parsing, you could then potentially try to pre-process the file into a different one. Since the file format of your files is very static, you could read the file and output to a different file (using a regex) until you get the tag. Then just throw out the data until you close the </step> tag or </test> tag. That will result in a valid, but hopefully much smaller XML file. The key here would be to do the 'parsing' yourself instead of having the underlying parser try to understand all of the document format, which could be much faster since your format is simple. You could then run your original program on this output which will not 'see' any of the extraneous tags. Of course, this breaks if you actually have nested <step> tags, but if that is the case, then you likely need to parse the file with a real XML parser to understand where the first-level starts and stops.

How to in-place parse and edit huge xml file [duplicate]

This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>

Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works

You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser

Python pickle to xml

How can I convert to a pickle object to a xml document?
For example, I have a pickle like that:
cpyplusplus_test
Coordinate
p0
(I23
I-11
tp1
Rp2
.
I want to get something like:
<Coordinate>
<x>23</x>
<y>-11</y>
</Coordinate>
The Coordinate class has x and y attributes of course. I can supply a xml schema for conversion.
I tried gnosis.xml module. It can objectify xml documents to python object. But it cannot serialize objects to xml documents like above.
Any suggestion?
Thanks.

gnosis.xml does support pickling to XML:
import gnosis.xml.pickle
xml_str = gnosis.xml.pickle.dumps(obj)
To deserialize the XML, use loads:
o2 = gnosis.xml.pickle.loads(xml_str)
Of course, this will not directly convert existing pickles to XML — you have to first deserialize them into live object, and then dump them to XML.
Having said that, I must warn you that gnosis.xml is quite slow, somewhat fragile, and most likely unmaintained (last release was over six years ago). It is also very bloated, containing a huge number of subpackages with lots and lots of features that not only you won't need, but are untested and buggy. We tried to use for our development and, after a lot of effort wasted on trying to debug and improve it, ended up writing a simple XML pickler running at ~500 lines of code, and never looked back.

First you need to unpickle the data by pickle.load or pickle.loads. Then generate xml snippet. If you have a pickle in tmpStr variable, simply do this:
c = pickle.loads(tmpStr)
print '<Coordinate>\n<x>%d</x>\n<y>%d</y>\n</Coordinate>' % (c.x, c.y)
Writing to file is left as an exercise to the reader.

How to wrap a proper generator function around a SAX Parser

I've got 35.5Mb .XLSM file. When the actual usable content is expanded, it swamps DOM parsers like element tree exhausting memory after a long, long running time.
When using a SAX parser, however, the ContentHandler seems to be constrained to accumulate rows in a temporary file. Which is a little irritating because the parser and the main application could have a simple co-routine relationship where each row parsed by SAX could be yielded to the application.
It doesn't look like the following is possible.
def gen_rows_from_xlsx( someFile ):
myHandler= HandlerForXLSX()
p= xml.sax.makeParser()
p.setContentHandler( myHandler, some_kind_of_buffer )
for row in some_kind_of_buffer.rows():
p.parse() # Just enough to get to the ContentHandler's "buffer.put()"
yield row
Periodically, the HandlerForXLSX would invoke some_kind_of_buffer.put( row ) to put a row into the buffer. This single row should be yielded through some_kind_of_buffer.rows().
A simple coroutine relationship between a SAX parser and gen_rows_from_xslx() would be ideal.
Have I overlooked some generator-function magic that will allow me to package SAX as a coroutine of some kind?
Is the only alternative to create a SAX parsing thread and use a Queue to get the rows built by the parser?
Or is it simpler to bite the bullet and create a temporary file in the SAX parser and then yield those objects through the generator?
Related: Lazy SAX XML parser with stop/resume.

"""I've got 35.5Mb .XLSM file. When the actual usable content is expanded, it swamps DOM parsers like element tree exhausting memory after a long, long running time."""
I don't understand this. Things you should be using:
import xml.etree.cElementTree as ET
ET.iterparse(sourcefile) # sourcefile being a cStringIO.StringIO instance holding your worksheet XML document
element.clear() # leave only scorched earth behind you
This article shows how to use iterparse and clear.
Example: Loading an XLSX (100Mb, most of which is two worksheets each with about 16K rows and about 200 cols) into the xlrd object model:
Elapsed time about 4 minutes [beat-up old laptop [2 GHz single-core] running Windows XP and Python 2.7]. Incremental memory usage maxes out at about 300Mb of memory, most of which is the output, not the element tree.

Seems like you could use the IncrementalParser interface for this? Something like:
def gen_rows_from_xlsx(someFile):
buf = collections.deque()
myHandler = HandlerForXLSX(buf)
p = xml.sax.make_parser()
p.setContentHandler(myHandler)
with open(someFile) as f:
while True:
d = f.read(BLOCKSIZE)
if not d: break
p.feed(d)
while buf: yield buf.popleft()
p.close()
To do this with parse, you would have to yield across multiple stack frames, something which Python simply does not support.

python libxml2 reader and XML_PARSE_RECOVER

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.
However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):
Sample code (with small example):
import cStringIO
import libxml2
DOC = "<a>some broken & xml</a>"
reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)
ret = reader.Read()
while ret:
print 'ret: %d' % ret
print "node name: ", reader.Name(), reader.NodeType()
ret = reader.Read()
Any ideas how to recover correctly?

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:
from cStringIO import StringIO
from lxml import etree
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())
The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.
Edit:
If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
for data in StringIO(DOC).read():
reader.feed(data)
tree = reader.close()
print etree.tostring(tree)

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?
For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.
EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.
You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.
In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.
In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).
Good luck.

Or, you could use BeautifulSoup. It does a nice job recovering broken ML.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.