Parsing Very Large XML Files Using Multiprocessing

Parsing Very Large XML Files Using Multiprocessing - python

I have a huge XML file, and I'm a tad bit at a loss on how to handle it. It's 60 GBs, and I need to read it.
I was thinking if there a way to use multiprocessing module to read the python file?
Does anyone have any samples of doing this that they could point me to?
Thank you

For a file of that size, I suggest you use a streaming XML parser. In Python, this would be the iterparse method from cElementTree or lxml.etree:
http://effbot.org/zone/element-iterparse.htm

Save memory parsing very large XML files
You could use this code which is a bit newer then the effbot.org one, it might save you more memory:
Using Python Iterparse For Large XML Files
Multiprocessing / Multithreading
If I remember correctly you can not do multiprocessing easily to speed up the proces when loading/parsing the XML. If this was an easy option everyone would probably already do it by default.
Python in general uses a global interpreter lock (GIL) and this causes Python to run within one proces and this is bound to one core of your CPU. When threads are used they run in context of the main Python proces which is still bound to only one core. Using threads in Python can lead to a performance decrease due to the context switching. Running multiple Python processes on multiple cores brings the expected additional performance, but those do not share memory so you need inter proces communication (IPC) to have processes work together (you can use multiprocessing in a pool, they sync when the work is done but mostly useful for (not to) small tasks that are finite). Sharing memory is required I would assume as every task is working on the same big XML.
LXML however has some way to work around the GIL but it only improves performance under certain conditions.
Threading in LXML
For introducing threading in lxml there is a part in the FAQ that talks about this: http://lxml.de/FAQ.html#id1
Can I use threads to concurrently access the lxml API?
Short answer: yes, if you use lxml 2.2 and later.
Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads
Does my program run faster if I use threads?
Depends. The best way to answer this is timing and profiling.
The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support multi-threading.
Additional tips on optimizing performance for parsing large XML
https://www.ibm.com/developerworks/library/x-hiperfparse/

Related

Will asyncio help me here?

I have a little python application that I have developed using wxpython4.0.3 that performs a fairly simple ETL type task of:
Takes user input to select a parent directory containing several
sub-directories full of CSV files (time series data.)
Transforms the data in each of those files to comply with formatting
required by a third party recipient and writes them out to new
files.
Zips each of the new files into a file named after the original
file's parent directory
At the user's initiative the application then FTPs the zip files to
the third party to be loaded into a data store.
The application works well enough, but the time required to process through several thousand CSV files is pretty extreme and mostly IO bound from what I can tell.
Is asyncio a reasonable option to pursue or are there other recommendations that anyone can make? I originally wrote this as a CLI and saw significant performance gains by using pypy, but I was reluctant to combine pypy with wxpython when I developed the UI for others.
Thanks for your guidance.

If you saw a significant speedup by using PyPy instead of CPython, that implies that your code probably isn't I/O-bound. Which means that making the I/O asynchronous isn't going to help very much. Plus, it'll be extra work, as well, because you'll have to restructure all of your CPU-heavy tasks into small pieces that can await repeatedly so they don't block the other tasks.
So, you probably want to use multiple processes here.
The simplest solution is to use a concurrent.futures.ProcessPoolExecutor: just toss tasks at the executor, and it'll run them on the child processes and return you a Future.
Unlike using asyncio, you won't have to change those tasks at all. They can read a file just by looping over the csv module, process it all in one big chunk, and even use the synchronous ftplib module, without needing to worry about anyone blocking anyone else. Only your top-level code needs to change.
However, you may want to consider splitting the code into a wx GUI that you run in CPython, and a multiprocessing engine that you run via subprocess in PyPy, which then spins off the ProcessPoolExecutor in PyPy as well. This would take a bit more work, but it means you'll get the CPU benefits of using PyPy, the well-tested-with-wx benefits of using CPython, and the parallelism of multiprocessing.
Another option to consider is pulling in a library like NumPy or Pandas that can do the slow parts (whether that's reading and processing the CSV, or doing some kind of elementwise computation on thousands of rows, or whatever) more quickly (and possibly even releasing the GIL, meaning you don't need multiprocessing).
If your code really is I/O-bound code, and primarily bound on the FTP requests, asyncio would help. But it would require rewriting a lot of code. You'd need to find or write an asyncio-driven FTP client library. And, if the file reading takes any significant part of your time, converting that to async is even more work.
There's also the problem of integrating the wx event loop with the asyncio event loop. You might be able to get away with running the asyncio loop in a second thread, but then you need to come up with some way of communicating between the wx event loop in the main thread and the asyncio loop in the background thread. Alternatively, you might be able to drive one loop from the other (or there might even be third-party libraries that do that for you). But this might be a lot easier to do with (or have better third-party libraries to help with) something like twisted instead of asyncio.
But, unless you need massive concurrency (which you probably don't, unless you've got hundreds of different FTP servers to talk to), threads should work just as well, with a lot fewer changes to your code. Just use a concurrent.futures.ThreadPoolExecutor, which is nearly identical to using a ProcessPoolExecutor as explained above.

Yes, you will probably benefit from using asynchronous library. Since most of your time is spent waiting for IO, a well-written asynchronous program will use that time to do something else, without the overhead of extra threads/processes. It will scale really well.

How to efficiently fan out large chunks of data into multiple concurrent sub-processes in Python?

[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!

In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!

Improving Python Threads Performance based on Resource Locking

The reason between Java and Python threads is
Java is designed to lock on the resources
Python is designed to lock the thread itself(GIL)
So Python's implementation performs better on a machine with singe core processor. This is fine 10-20 years before. With the increasing computing capacity, if we use multiprocessor machine with same piece of code, it performs very badly.
Is there any hack to disable GIL and use resource locking in Python(Like Java implementation)?
P.S. My application is currently running on Python 2.7.12. It is compute intensive with less I/O and network blocking. Assume that I can't use multiprocessing for my use case.

I think the most straight way for you, that will give you also a nice performance increase is to use Cython.
Cython is a Python superset that compiles Python-like code to C code (which makes use of the cPython API), and from there to executable. It allows one to optionally type variables, that then can use native C types instead of Python objects - and also allows one direct control of the GIL.
It does support a with nogil: statement in which the with block runs with the GIL turned off - if there are other threads running (you use the normal Python threading library), they won't be blocked while code is running on the marked with block.
Just keep in mind that the GIL is there for a reason: it is thanks to it that global complex objects like lists and dictionaries work without the danger of getting into an inconsistent state between treads. But if your "nogil" blocks restrict themselves to local data structures, you should have no problems.
Check the Cython project - and here is an specific example of turning off the GIL:
https://lbolla.info/blog/2013/12/23/python-threads-cython-gil

Python:When to use Threads vs. Multiprocessing

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?

Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)

For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

shared data utilizing multiple processor in python

I have a cpu intensive code which uses a heavy dictionary as data (around 250M data). I have a multicore processor and want to utilize it so that i can run more than one task at a time. The dictionary is mostly read only and may be updated once a day.
How can i write this in python without duplicating the dictionary?
I understand that python threads don't use native threads and will not offer true concurrency. Can i use multiprocessing module without data being serialized between processes?
I come from java world and my requirement would be something like java threads which can share data, run on multiple processors and offers synchronization primitives.

You can share read-only data among processes simply with a fork (on Unix; no easy way on Windows), but that won't catch the "once a day change" (you'd need to put an explicit way in place for each process to update its own copy). Native Python structures like dict are just not designed to live at arbitrary addresses in shared memory (you'd have to code a dict variant supporting that in C) so they offer no solace.
You could use Jython (or IronPython) to get a Python implementation with exactly the same multi-threading abilities as Java (or, respectively, C#), including multiple-processor usage by multiple simultaneous threads.

Use shelve for the dictionary. Since writes are infrequent there shouldn't be an issue with sharing it.

Take a look at this in the stdlib:
http://docs.python.org/library/multiprocessing.html
There are a bunch of wonderful features that will allow you to share data structures between processes very easily.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.