Parallelizing without pickle

Parallelizing without pickle - python

Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.
The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.
As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.
However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.
Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?

The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.
I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.
-edit-
You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.

Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:
def foo():
os.system('rm -rf /')
return {'lol': foo}
But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.
I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.

Related

Python multiprocessing.Pool(): am I limited in what I can return?

I am using Python's multi-processing pool. I have been told, although not experienced this myself so I cannot post the code, that one cannot just "return" anything from within the multiprocessing.Pool()-worker back to the multiprocessing.Pool()'s main process. Words like "pickling" and "lock" were being thrown around but I am not sure.
Is this correct, and if so, what are these limitations?
In my case, I have a function which generates a mutable class object and then returns it after it has done some work with it. I'd like to have 8 processes run this function, generate their own classes, and return each of them after they're done. Full code is NOT written yet, so I cannot post it.
Any issues I may run into?
My code is: res = pool.map(foo, list_of_parameters)

Q : "Is this correct, and if so, what are these limitations?"
It depends. It is correct, but the SER/DES processing is the problem here, as a pair of disjoint processes tries to "send" something ( there: a task specification with parameters and back: ... Yessss, the so long waited for result* )
Initial versions of the Python standard library of modules piece, responsible for doing this, the pickle-module, was not able to SER-ialise some more complex types of objects, Class-instances being one such example.
There are newer and newer versions evolving, sure, yet this SER/DES step is one of the SPoFs that may avoid a smooth code-execution for some such cases.
Next are the cases, that finish by throwing a Memory Error as they request as much memory allocations, that the O/S simply rejects any new request for such an allocation, and the whole process attempt to produce and send pickle.dumps( ... ) un-resolvably crashes.
Do we have any remedies available?
Well, maybe yes, maybe no - Mike McKearn's dill may help in some cases to better handle complex objects in SER/DES-processing.
May try to use import dill as pickle; pickle.dumps(...) and test your hot-candidates for Class()-instances to get SER/DES-ed, if they get a chance to pass through. If not, no way using this low-hanging fruit first trick.
Next, a less easy way would be to avoid your dependence on hardwired multiprocessing.Pool()-instantiations and their (above)-limited SER/comms/DES-methods, and design your processing strategy as a distributed-computing system, based on a communicating agents paradigm.
That way you benefit from a right-sized, just-enough designed communication interchange between intelligent-enough agents, that know (as you've designed them to know it) what to tell one to the others, even without sending any mastodon-sized BLOB(s), that accidentally crash the processing in any of the SPoF(s) you cannot both prevent and salvage ex-post.
There seem no better ways forward I know about or can foresee in 2020-Q4 for doing this safe and smart.

How to store easily python usable read-only data structures in shared memory

I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?

This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!

It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/

Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.

One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.

Persistent in-memory Python object for nginx/uwsgi server

I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.

What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.

"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.

You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).

I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?

How to deserialize 1GB of objects into Python faster than cPickle?

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

Keeping in-memory data in sync with a file for long running Python script

I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?

Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.

I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.

Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}

Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.

You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.