Python - Large XML to JSON to File / RAM and Swap Overload - python

I'm currently working on creating a Pythonic way of parsing through OpenStreetMaps province/states dumps; which as far as I know is just knowing how to deal with very large XML files (right?).
I'm currently using lxml etree iterparse module in order to parse through the dumps for the Province of Quebec(quebec-latest.osm.bz2). I'd like to pull any entry that has highway information, convert to JSON, save it to file, and flush, though it doesn't seem to be working.
I'm currently running an i7-4770, 16GBs of RAM, 128GB SSD, and OSX 10.9. When I launch the code below, my RAM fills up completely within a couple seconds, and my swap within 30 seconds. Afterwards my system will either request that I close applications to make room, or eventually freeze.
Here's my code; You'll notice most likely a lot of bad/garbadge code in there, but I got to the point where I was plugging in whatever I could find in hope for it to work. Any help on this is greatly appreciated. Thanks!
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from lxml import etree
import xmltodict, json, sys, os, gc
hwTypes = ['motorway', 'trunk', 'primary', 'secondary', 'tertiary', 'pedestrian', 'unclassified', 'service']
#Enable Garbadge Collection
gc.enable()
def processXML(tagType):
f = open('quebecHighways.json', 'w')
f.write('[')
print 'Processing '
for event, element in etree.iterparse('quebec-latest.osm', tag=tagType):
data = etree.tostring(element)
data = xmltodict.parse(data)
keys = data[tagType].keys()
if 'tag' in keys:
if isinstance(data[tagType]['tag'], dict):
if data[tagType]['tag']['#k'] == 'highway':
if data[tagType]['tag']['#v'] in hwTypes:
f.write(json.dumps(data)+',')
f.flush() #Flush Python
os.fsync(f.fileno()) #Flush System
gc.collect() #Garbadge Collect
else:
for y in data[tagType]['tag']:
if y['#k'] == 'highway':
if y['#v'] in hwTypes:
f.write(json.dumps(data)+',')
f.flush()
os.fsync(f.fileno())
gc.collect()
break
#Supposedly there is supposed to help clean my RAM.
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
f.write(']')
f.close()
return 0
processXML('way')

The library xmltodict stores the dictionary generated in memory, so if your data dictionaries are big it is not really a good idea to do so. Using only iterparse would be more efficient.
Another option could be to use the streaming possibilities offered by xmltodict. More info in http://omz-software.com/pythonista/docs/ios/xmltodict.html.

I would say that you are making your life more complicated than necessary. You are effectively repeatedly dumping whole subtree into xmltodict and making it parse that whole subtree again and again.
If I were you, I would just dump xmltodict, sit on your behind, read a tutorial or two, and just use something standard: xml.sax (it is really not that difficult, if you don't need too many jumps ahead; just working on converting Bible) or iterparse and use just that. It is really not that complicated.

Related

How come Python 3 deserializes json files so fast?

Recently, I had to json.load() a file. A 13,045KB file from a network location. Over a slow corporate VPN. I did it in an interactive shell and decided to go grab a coffee in the meantime, because this would surely take ages to load. I didn't even manage do stand up and my code was done reading the json and translating its half milion lines into a beautiful dictionary.
How does it happen that Python is able to do it so efficiently? What optimizations are used? Why does it take a few minutes for Sublime to show this json as a text?
>>> f_path = r"Z:\some\network\location\data.json"
>>> import os
>>> os.stat(f_path).st_size
13357297
>>> with measurement_tools.SimpleBenchmark():
... with open(f_path) as f:
... t_dict = json.load(f)
Took 0.203125 seconds.
Edit: SimpleBenchmark() is just my own context manager measuring tool doing basically t1-t0.
Edit 2: since there were questions about why I consider it fast, I compared it to serializing the same dictionary into a json in the same location which took c.a. 10 times as much time (well over 2 secs).

Saving and loading simple data in Python convenient way

I'm currently working on a simple Python 3.4.3 and Tkinter game.
I struggle with saving/reading data now, because I'm a beginner at coding.
What I do now is use .txt files to store my data, but I find this extremely counter-intuitive, as saving/reading more than one line of data requires of me to have additional code to catch any newlines.
Skipping a line would be terrible too.
I've googled it, but I either find .txt save/file options or way too complex ones for saving large-scale data.
I only need to save some strings right now and be able to access them (if possible) by key like in a dictionary key:value .
Do you know of any file format/method to help me accomplish that?
Also: If possible, should work on Win/iOS/Linux.
It sounds like using json would be best for this, which comes as part of the Python Standard library in Python-2.6+
import json
data = {'username':'John', 'health':98, 'weapon':'warhammer'}
# serialize the data to user-data.txt
with open('user-data.txt', 'w') as fobj:
json.dump(data, fobj)
# read the data back in
with open('user-data.txt', 'r') as fobj:
data = json.load(fobj)
print(data)
# outputs:
# {u'username': u'John', u'weapon': u'warhammer', u'health': 98}
A popular alternative is yaml, which is actually a superset of json and produces slightly more human readable results.
You might want to try Redis.
http://redis.io/
I'm not totally sure it'll meet all your needs, but it would probably be better than a flat file.

How to wrap a proper generator function around a SAX Parser

I've got 35.5Mb .XLSM file. When the actual usable content is expanded, it swamps DOM parsers like element tree exhausting memory after a long, long running time.
When using a SAX parser, however, the ContentHandler seems to be constrained to accumulate rows in a temporary file. Which is a little irritating because the parser and the main application could have a simple co-routine relationship where each row parsed by SAX could be yielded to the application.
It doesn't look like the following is possible.
def gen_rows_from_xlsx( someFile ):
myHandler= HandlerForXLSX()
p= xml.sax.makeParser()
p.setContentHandler( myHandler, some_kind_of_buffer )
for row in some_kind_of_buffer.rows():
p.parse() # Just enough to get to the ContentHandler's "buffer.put()"
yield row
Periodically, the HandlerForXLSX would invoke some_kind_of_buffer.put( row ) to put a row into the buffer. This single row should be yielded through some_kind_of_buffer.rows().
A simple coroutine relationship between a SAX parser and gen_rows_from_xslx() would be ideal.
Have I overlooked some generator-function magic that will allow me to package SAX as a coroutine of some kind?
Is the only alternative to create a SAX parsing thread and use a Queue to get the rows built by the parser?
Or is it simpler to bite the bullet and create a temporary file in the SAX parser and then yield those objects through the generator?
Related: Lazy SAX XML parser with stop/resume.
"""I've got 35.5Mb .XLSM file. When the actual usable content is expanded, it swamps DOM parsers like element tree exhausting memory after a long, long running time."""
I don't understand this. Things you should be using:
import xml.etree.cElementTree as ET
ET.iterparse(sourcefile) # sourcefile being a cStringIO.StringIO instance holding your worksheet XML document
element.clear() # leave only scorched earth behind you
This article shows how to use iterparse and clear.
Example: Loading an XLSX (100Mb, most of which is two worksheets each with about 16K rows and about 200 cols) into the xlrd object model:
Elapsed time about 4 minutes [beat-up old laptop [2 GHz single-core] running Windows XP and Python 2.7]. Incremental memory usage maxes out at about 300Mb of memory, most of which is the output, not the element tree.
Seems like you could use the IncrementalParser interface for this? Something like:
def gen_rows_from_xlsx(someFile):
buf = collections.deque()
myHandler = HandlerForXLSX(buf)
p = xml.sax.make_parser()
p.setContentHandler(myHandler)
with open(someFile) as f:
while True:
d = f.read(BLOCKSIZE)
if not d: break
p.feed(d)
while buf: yield buf.popleft()
p.close()
To do this with parse, you would have to yield across multiple stack frames, something which Python simply does not support.

python libxml2 reader and XML_PARSE_RECOVER

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.
However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):
Sample code (with small example):
import cStringIO
import libxml2
DOC = "<a>some broken & xml</a>"
reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)
ret = reader.Read()
while ret:
print 'ret: %d' % ret
print "node name: ", reader.Name(), reader.NodeType()
ret = reader.Read()
Any ideas how to recover correctly?
I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:
from cStringIO import StringIO
from lxml import etree
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())
The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.
Edit:
If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
for data in StringIO(DOC).read():
reader.feed(data)
tree = reader.close()
print etree.tostring(tree)
Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?
For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.
EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).
Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.
You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.
In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.
In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).
Good luck.
Or, you could use BeautifulSoup. It does a nice job recovering broken ML.

Showing progress of python's XML parser when loading a huge file

Im using Python's built in XML parser to load a 1.5 gig XML file and it takes all day.
from xml.dom import minidom
xmldoc = minidom.parse('events.xml')
I need to know how to get inside that and measure its progress so I can show a progress bar.
any ideas?
minidom has another method called parseString() that returns a DOM tree assuming the string you pass it is valid XML, If I were to split up the file myself into chunks and pass them to parseString one at a time, could I possibly merge all the DOM trees back together at the end?
you usecase requires that you use sax parser instead of dom, dom loads everything in memory , sax instead will do line by line parsing and you write handlers for events as you need
so could be effective and you would be able to write progress indicator also
I also recommend trying expat parser sometime it is very useful
http://docs.python.org/library/pyexpat.html
for progress using sax:
as sax reads file incrementally you can wrap the file object you pass with your own and keep track how much have been read.
edit:
I also don't like idea of splitting file yourselves and joining DOM at end, that way you are better writing your own xml parser, i recommend instead using sax parser
I also wonder what your purpose of reading 1.5 gig file in DOM tree?
look like sax would be better here
Did you consider to use other means of parsing XML? Building a tree of such big XML files will always be slow and memory intensive. If you don't need the whole tree in memory, stream based parsing will be much faster. It can be a little daunting if you're used to tree based XML manipulation, but it will pay of in form of a huge speed increase (minutes instead of hours).
http://docs.python.org/library/xml.sax.html
I have something very similar for PyGTK, not PyQt, using the pulldom api. It gets called a little bit at a time using Gtk idle events (so the GUI doesn't lock up) and Python generators (to save the parsing state).
def idle_handler (fn):
fh = open (fn) # file handle
doc = xml.dom.pulldom.parse (fh)
fsize = os.stat (fn)[stat.ST_SIZE]
position = 0
for event, node in doc:
if position != fh.tell ():
position = fh.tell ()
# update status: position * 100 / fsize
if event == ....
yield True # idle handler stays until False is returned
yield False
def main:
add_idle_handler (idle_handler, filename)
Merging the tree at the end would be pretty easy. You could just create a new DOM, and basically append the individual trees to it one by one. This would give you pretty finely tuned control over the progress of the parsing too. You could even parallelize it if you wanted by spawning different processes to parse each section. You just have to make sure you split it intelligently (not splitting in the middle of a tag, etc.).

Categories

Resources