I have a cpu intensive code which uses a heavy dictionary as data (around 250M data). I have a multicore processor and want to utilize it so that i can run more than one task at a time. The dictionary is mostly read only and may be updated once a day.
How can i write this in python without duplicating the dictionary?
I understand that python threads don't use native threads and will not offer true concurrency. Can i use multiprocessing module without data being serialized between processes?
I come from java world and my requirement would be something like java threads which can share data, run on multiple processors and offers synchronization primitives.
You can share read-only data among processes simply with a fork (on Unix; no easy way on Windows), but that won't catch the "once a day change" (you'd need to put an explicit way in place for each process to update its own copy). Native Python structures like dict are just not designed to live at arbitrary addresses in shared memory (you'd have to code a dict variant supporting that in C) so they offer no solace.
You could use Jython (or IronPython) to get a Python implementation with exactly the same multi-threading abilities as Java (or, respectively, C#), including multiple-processor usage by multiple simultaneous threads.
Use shelve for the dictionary. Since writes are infrequent there shouldn't be an issue with sharing it.
Take a look at this in the stdlib:
http://docs.python.org/library/multiprocessing.html
There are a bunch of wonderful features that will allow you to share data structures between processes very easily.
Related
I have a readonly very large dataframe and I want to do some calculation so I do a multiprocessing.map and set the dataframe as global. However, does this imply that for each process, the program will copy the dataframe separately(so it will be fast then a shared one)?
If I understand it correctly you won't get any benefit by trying to use multiprocessing.map on a Pandas DataFrame because the DataFrame is built over NumPy ndarray structures and NumPy already releases the GIL, scales to SMP hardware, uses vectorized machine instructions where they're available and so on.
As you say, you might be incurring massive RAM consumption and data copying or shared memory locking overhead on the DataFrame structures for no benefit. Performance consideration on the combination of NumPy and Python's multiprocessing module are discussed in this SO Question: Multiprocessing.Pool makes Numpy matrix multiplication slower.
The fact that you're treating this DataFrame as read-only is interesting because it suggests that you could write code around os.fork() which, due to the OS CoW (copy-on-write) semantics through the fork() system call, should be an inexpensive way to share the data with child processes, allowing each to then analyze the date in various ways. (Any code which writes to the data would trigger allocation of fresh pages and copying, of course).
The multiprocessing module is using the fork() system call under the hood (at least on Unix, Linux and similar systems). If you create and fully populate this large data structure (DataFrame) before you invoke any of the multiprocessing functions or instantiate any of its objects which create subprocesses, then you might be able to access the copies of the DataFrame which each process implicitly inherited. I don't have the time to concoct a bit of test code right now; but that might work.
As for consolidating your results back to some parent or delegate process ... you could do that through any IPC (inter-process communications) mechanism. If you were able to share the data implicitly by initializing it before calling any multiprocessing forking methods, then you might be able to simple instantiate a multiprocessing.Queue and feeding results through it. Failing that, personally I'd consider just setting up an instances of Redis either on the same system or on any other system on that LAN segment. Redis is very efficient and extremely easy to configure and maintain, with APIs and Python modules (with automatic/transparent support for hiredis for high performance deserialization of Redis results).
Redis might also make it easier to distribute your application across multiple nodes if your needs take you in that direction. Of course by then you might also be looking at using PySpark which can offer many features which map fairly well from Pandas DataFrames to Apache Spark RDD Sets (or Spark SQL "DataFrames"). Here's an article on that from a couple years ago: Databricks: From Pandas to Apache Spark's DataFrames.
In general the whole point of Apache Spark is to distribute data computations across separate nodes; this is inherently more scaleable than distributing them across cores within a single machine. (Then, the concerns boil down to I/O to the nodes so each can get its chunks of the data set loaded. That's a problem which is well suited to HDFS.
I hope that helps.
Every sub-process will have its own resources, so this imply does. More precisely, every sub-process will copy a part of original dataframe, determined by your implementation.
But will it be fast then a shared one? I'm not sure. Unless your dataframe implement a w/r lock, read a shared one or read separated ones are the same. But why a dataframe need to lock read operation? It doesn't make sense.
[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!
In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!
I have a huge XML file, and I'm a tad bit at a loss on how to handle it. It's 60 GBs, and I need to read it.
I was thinking if there a way to use multiprocessing module to read the python file?
Does anyone have any samples of doing this that they could point me to?
Thank you
For a file of that size, I suggest you use a streaming XML parser. In Python, this would be the iterparse method from cElementTree or lxml.etree:
http://effbot.org/zone/element-iterparse.htm
Save memory parsing very large XML files
You could use this code which is a bit newer then the effbot.org one, it might save you more memory:
Using Python Iterparse For Large XML Files
Multiprocessing / Multithreading
If I remember correctly you can not do multiprocessing easily to speed up the proces when loading/parsing the XML. If this was an easy option everyone would probably already do it by default.
Python in general uses a global interpreter lock (GIL) and this causes Python to run within one proces and this is bound to one core of your CPU. When threads are used they run in context of the main Python proces which is still bound to only one core. Using threads in Python can lead to a performance decrease due to the context switching. Running multiple Python processes on multiple cores brings the expected additional performance, but those do not share memory so you need inter proces communication (IPC) to have processes work together (you can use multiprocessing in a pool, they sync when the work is done but mostly useful for (not to) small tasks that are finite). Sharing memory is required I would assume as every task is working on the same big XML.
LXML however has some way to work around the GIL but it only improves performance under certain conditions.
Threading in LXML
For introducing threading in lxml there is a part in the FAQ that talks about this: http://lxml.de/FAQ.html#id1
Can I use threads to concurrently access the lxml API?
Short answer: yes, if you use lxml 2.2 and later.
Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads
Does my program run faster if I use threads?
Depends. The best way to answer this is timing and profiling.
The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support multi-threading.
Additional tips on optimizing performance for parsing large XML
https://www.ibm.com/developerworks/library/x-hiperfparse/
What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?
Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)
For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.
It seems that Python standard library lacks various useful concurrency-related concepts such as atomic counter, executor and others that can be found in e.g. java.util.concurrent. Are there any external libraries that would provide easier building blocks for concurrent Python applications?
Kamaelia, as already mentioned, is aimed at making concurrency easier to work with in python.
Its original use case was network systems (which are a naturally concurrent) and developed with the viewpoint "How can we make these systems easier to develop and maintain".
Since then life has moved on and it is being used in a much wider variety of problem domains from desktop systems (like whiteboarding applications, database modelling, tools for teaching children to read and write) through to back end systems for websites (like stuff for transcoding & converting user contributed images and video for web playback in a variety of scenarios and SMS / text messaging applications.
The core concept is essentially the same idea as Unix pipelines - except instead of processes you can have python generators, threads, or processes - which are termed components. These communicate over inboxes and outboxes - as many as you like of each, rather than just stdin/stdout/stderr. Also rather than requiring serialised file interfaces, you pass between components fully fledged python objects. Also rather than being limited to pipelines, you can have arbitrary shapes - called graphlines.
You can find a full tutorial (video, slides, downloadable PDF booklet) here:
http://www.kamaelia.org/PragmaticConcurrency
Or the 5 minute version here (O'Reilly ignite talk):
http://yeoldeclue.com/cgi-bin/blog/blog.cgi?rm=viewpost&nodeid=1235690128
The focus on the library is pragmatic development, system safety and ease of maintenance though some effort has gone in recently towards adding some syntactic sugar. Like anything the developers (me and others :-) welcome feedback on improving it.
You can also find more information here:
- http://www.slideshare.net/kamaelian
Primarily, Kamaelia's core (Axon) was written to make my day job easier, and to wrap up best practice (message passing, software transactional memory) in a reusable fashion. I hope it makes your life easier too :-)
Although it may not be immediately obvious, itertools.count is indeed an atomic counter (the only operation on an instance x thereof, spelled next(x), is equivalent to an "atomic ++x" if C had such a concept;-). Edit: at least, this surely holds in CPython; I thought it was part of the Python standard definition but apparently IronPython and Jython disagree (not ensuring thread-safety of count.next in their current implementations) so I may well be wrong!
That is, suppose you currently have a data structure such as:
counters = dict.fromkeys(words_of_interest, 0)
...
if w in counters: counters[w] += 1
and your problem is that the latter increment is not atomic, so if two threads are at the same time dealing with the same word of interest the two increments might interfere (only one would "take", so the counter would be incremented only by one, not by two). Then:
counters = dict((w, itertools.count()) for w in words_of_interest)
...
if w in counters: next(counters[w])
will perform the same operations, but in an atomic way.
(There is unfortunately no obvious, documented way to "extract the current value of the counter", though in fact str(x) does return a string such as 'count(3)' from which the current value can be parsed out again;-).
Concurrency in Python (at least CPython) and Java are wildly different, at least in part because of the Global Interpreter Lock (GIL). In general, concurrency in Python is achieved not with threads, but processes. See multiprocessing for the "standard" concurrency module.
Also, check out "A Curious Course on Coroutines and Concurrency" for some concurrency techniques that were pretty new to me coming from Java. David Beazley (the author) is a Smart Guy™ when it comes to Python in general, and concurrency in particular.
kamaelia provides tools for abstracting concurrency to threads or process etc.
P-workers creates a "Job-Worker" abstraction over the python multiprocessing library. It simplifies concurrency with multiprocessing by starting "Workers" that have specific skills/attributes (defined functions), and providing a queue where they receive "Jobs" from. Its somewhat analogous to a thread pool, only with processes instead of threads. Therefore its better suited for a high number of CPU instructions. You can also use it to spawn multiple instances of a single application or even spawn "Workers" that have multiple threads.