Parsing a large (~40GB) XML text file in python

Parsing a large (~40GB) XML text file in python - python

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
Existing XML solutions I know of:
element tree
minixml
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

The best solution will depend in part on what you are trying to do, and how free your system resources are. Converting it to a postgresql or similar database might not be a bad first goal; on the other hand, if you just need to pull data out once, it's probably not needed. When I have to parse large XML files, especially when the goal is to process the data for graphs or the like, I usually convert the xml to S-expressions, and then use an S-expression interpreter (implemented in python) to analyse the tags in order and build the tabulated data. Since it can read the file in a line at a time, the length of the file doesn't matter, so long as the resulting tabulated data all fits in memory.

Related

Retrieve only a portion of an XML feed

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because right now the RAM consumed is pretty high, maybe something to put in the link like:
"http://site/feed.xml?limit=10", i've searched if there is something similar to this but i haven't found anything.
Another option would be limit the items parsed by scrapy, but i don't know how to do that.Right now once the XMLFeedSpider parsed the whole document the bot will analyze only the first ten items, but i supposes that the whole feed will still be in the memory.
Have you any idea on how to improve the bot's performance , diminishing the RAM and CPU consumption? Thanks

When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.
SAX parsers have some benefits over DOM-style parsers. A SAX parser
only needs to report each parsing event as it happens, and normally
discards almost all of that information once reported (it does,
however, keep some things, for example a list of all elements that
have not been closed yet, in order to catch later errors such as
end-tags in the wrong order). Thus, the minimum memory required for a
SAX parser is proportional to the maximum depth of the XML file (i.e.,
of the XML tree) and the maximum data involved in a single XML event
(such as the name and attributes of a single start-tag, or the content
of a processing instruction, etc.).
For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.
In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.

You should set the iterator mode of your XMLFeedSpider to iternodes (see here):
It’s recommended to use the iternodes iterator for performance reasons
After doing so, you should be able to iterate over your feed and stop at any point.

How to make a python instanced object reusable?

a couple of my python programs aim to
format into a hash table (hence, I'm a dict() addict ;-) ) some informations in a "source" text file, and
to use that table to modify a "target" file. My concern is that the "source" files I usually process can be very large (several GB) so it makes more than 10sec to parse, and I need to run that program a bunch of times. To conclude, I feel like it's a waste to reload the same large file each time I need to modify a new "target".
My thought is, if it would be possible to write once the dict() made from the "source" file in a way that python would be able to read/process much faster (I think about a format close to the one used in RAM by python), it would be great.
Is there a possibility to achieve that?
Thank you.

Yea, you can marshal the dict, or you can use pickle. For the difference between the two, especially as regards to speed, see this question.

pickle is the usual solution to such things, but if you see any value in being able to edit the saved data, and if the dictionary uses only simple types such as strings and numbers (nested dictionaries or lists are also OK), you can simply write the repr() of the dictionary to a text file, then parse it back into a Python dictionary using eval() (or, better yet, ast.literal_eval()).

Is there any way to know how much memory an ElementTree DOM consumes?

Suppose you do the following:
dom = ElementTree()
dom.parse(some_file_path)
I'd like to log the rough amount of memory that this dom is now using in my process. I don't need something precise, something rough would do.
I don't think I can derive it from the size of the source XML file. I have a 500 kilobyte file that seems to add about 5MB to the memory usage of my python process after it's loaded as in the example above.
I looked over the ElementTree API and didn't see any API to provide this information. Anyone know of a way to know how much memory the ElementTree instance is using after parsing/loading an XML file?

Essentially you want to find memory consumption of a particular python object. That's what it is. Here its an ElementTree object but it can be anything.
To cut the chase short as far as I know There's no easy way to find out the memory size of a python object. One of the problems you may find is that Python objects - like lists and dicts - may have references to other python objects (in this case, what would your size be? The size containing the size of each object or not?). There are some pointers overhead and internal structures related to object types and garbage collection. Finally, some python objects have non-obvious behaviors. For instance, lists reserve space for more objects than they have, most of the time; dicts are even more complicated since they can operate in different ways (they have a different implementation for small number of keys and sometimes they over allocate entries).
There is a big chunk of code out there to try to best approximate the size of a python object in memory. There's also some simpler approximations. But they will always be approximations.
You may also want to check some old description about PyObject (the internal C struct that represents virtually all python objects).
also, PySizer, "a memory profiler for Python," found at http://pysizer.8325.org/. However the page seems to indicate that the project hasn't been updated for a while, and refers to...
You could give Heapy a try, "support[ing] debugging and optimization regarding memory related issues in Python programs," found at http://guppy-pe.sourceforge.net/#Heapy.
objgraph looks interesting: http://mg.pov.lt/objgraph/

Keeping in-memory data in sync with a file for long running Python script

I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?

Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.

I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.

Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}

Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.

You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.

Python and Memory Consumption

I am searching for a way to be able to handle overloading the RAM and CPU using a high memory program... I would like to process a LARGE amount of data contained in files. I then read the files and process the data therein. The problem is there are many nested for loops and a root XML file is being created from all the data processed.
The program easily consumes a couple gigs of RAM after half hour or so of run-time.
Is there something I can do to not let RAM get so big and/or work around it.. ?

Do you really need to keep the whole data from the XML file on memory at once?
Most (all?) XML libraries out there allow you to do iterative parsing, meaning that you keep in memory just a few nodes of the XML file, not the whole file. That is unless you are making a string containing the XML file yourself without any library, but that is a bit insane. If that is the case, use a library ASAP.
The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:
Use an iterative parsing strategy to incrementally process large documents.
If searching the entire document in random order is required, move to an indexed XML database.
Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.
Consider the task at hand and the comfort level of the developer. Object models such as lxml's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.
Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.
If you need to do complex operations on the data, why don't you just put it on a relational database and operate on the data from there? That will have better performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.