multiprocessing.map in python with very large read-only object?

multiprocessing.map in python with very large read-only object? - python

I have a readonly very large dataframe and I want to do some calculation so I do a multiprocessing.map and set the dataframe as global. However, does this imply that for each process, the program will copy the dataframe separately(so it will be fast then a shared one)?

If I understand it correctly you won't get any benefit by trying to use multiprocessing.map on a Pandas DataFrame because the DataFrame is built over NumPy ndarray structures and NumPy already releases the GIL, scales to SMP hardware, uses vectorized machine instructions where they're available and so on.
As you say, you might be incurring massive RAM consumption and data copying or shared memory locking overhead on the DataFrame structures for no benefit. Performance consideration on the combination of NumPy and Python's multiprocessing module are discussed in this SO Question: Multiprocessing.Pool makes Numpy matrix multiplication slower.
The fact that you're treating this DataFrame as read-only is interesting because it suggests that you could write code around os.fork() which, due to the OS CoW (copy-on-write) semantics through the fork() system call, should be an inexpensive way to share the data with child processes, allowing each to then analyze the date in various ways. (Any code which writes to the data would trigger allocation of fresh pages and copying, of course).
The multiprocessing module is using the fork() system call under the hood (at least on Unix, Linux and similar systems). If you create and fully populate this large data structure (DataFrame) before you invoke any of the multiprocessing functions or instantiate any of its objects which create subprocesses, then you might be able to access the copies of the DataFrame which each process implicitly inherited. I don't have the time to concoct a bit of test code right now; but that might work.
As for consolidating your results back to some parent or delegate process ... you could do that through any IPC (inter-process communications) mechanism. If you were able to share the data implicitly by initializing it before calling any multiprocessing forking methods, then you might be able to simple instantiate a multiprocessing.Queue and feeding results through it. Failing that, personally I'd consider just setting up an instances of Redis either on the same system or on any other system on that LAN segment. Redis is very efficient and extremely easy to configure and maintain, with APIs and Python modules (with automatic/transparent support for hiredis for high performance deserialization of Redis results).
Redis might also make it easier to distribute your application across multiple nodes if your needs take you in that direction. Of course by then you might also be looking at using PySpark which can offer many features which map fairly well from Pandas DataFrames to Apache Spark RDD Sets (or Spark SQL "DataFrames"). Here's an article on that from a couple years ago: Databricks: From Pandas to Apache Spark's DataFrames.
In general the whole point of Apache Spark is to distribute data computations across separate nodes; this is inherently more scaleable than distributing them across cores within a single machine. (Then, the concerns boil down to I/O to the nodes so each can get its chunks of the data set loaded. That's a problem which is well suited to HDFS.
I hope that helps.

Every sub-process will have its own resources, so this imply does. More precisely, every sub-process will copy a part of original dataframe, determined by your implementation.
But will it be fast then a shared one? I'm not sure. Unless your dataframe implement a w/r lock, read a shared one or read separated ones are the same. But why a dataframe need to lock read operation? It doesn't make sense.

Related

How to efficiently fan out large chunks of data into multiple concurrent sub-processes in Python?

[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!

In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!

How to store easily python usable read-only data structures in shared memory

I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?

This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!

It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/

Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.

One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.

Creating an in-memory cache that persists between executions

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!

I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

How to deserialize 1GB of objects into Python faster than cPickle?

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

shared data utilizing multiple processor in python

I have a cpu intensive code which uses a heavy dictionary as data (around 250M data). I have a multicore processor and want to utilize it so that i can run more than one task at a time. The dictionary is mostly read only and may be updated once a day.
How can i write this in python without duplicating the dictionary?
I understand that python threads don't use native threads and will not offer true concurrency. Can i use multiprocessing module without data being serialized between processes?
I come from java world and my requirement would be something like java threads which can share data, run on multiple processors and offers synchronization primitives.

You can share read-only data among processes simply with a fork (on Unix; no easy way on Windows), but that won't catch the "once a day change" (you'd need to put an explicit way in place for each process to update its own copy). Native Python structures like dict are just not designed to live at arbitrary addresses in shared memory (you'd have to code a dict variant supporting that in C) so they offer no solace.
You could use Jython (or IronPython) to get a Python implementation with exactly the same multi-threading abilities as Java (or, respectively, C#), including multiple-processor usage by multiple simultaneous threads.

Use shelve for the dictionary. Since writes are infrequent there shouldn't be an issue with sharing it.

Take a look at this in the stdlib:
http://docs.python.org/library/multiprocessing.html
There are a bunch of wonderful features that will allow you to share data structures between processes very easily.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.