How to deserialize 1GB of objects into Python faster than cPickle?

How to deserialize 1GB of objects into Python faster than cPickle? - python

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

Related

Which one is faster in production? file on disk vs file in memory(StringIO,BytesIO)

I am converting a dictionary to pandas object with to_csv. I have both way of doing this
1 - by writing file in disk(with open statement)
2 - by writing in memory (StringIO,BytesIO)
I have used it in both way creating file in disk and using StringIO to convert to pandas object. I tried to read comparisons between these three, but bit confused which one is faster so i can use it in production to process tons of data.

Writing and reading from memory is fast. But keep in mind that you have tons of data. So storing all that in-memory might take up all your memory and might make the system slow or might throw errors due to Out of Memory. So, analyze and understand which all data to be put in memory and which all to be written to files.

In general - writing to RAM (memory) will be faster.
However, you might want to use Iterators (saving memory using iterators) if you have too much data, because your machine might will run out-of-memory, or just will write a lot to your SWAP file (in short - that's an "extension" of your RAM in your hard drive, you can read about it here), which will hurt your performance, a lot.
For benchmarking, if your code is pretty simple - I would recommenced using timeit, but there are even better resources for that, such as this one, from scipy

How to efficiently fan out large chunks of data into multiple concurrent sub-processes in Python?

[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!

In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!

How to store easily python usable read-only data structures in shared memory

I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?

This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!

It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/

Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.

One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.

Creating an in-memory cache that persists between executions

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!

I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

Fastest way to save and load a large dictionary in Python

I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks

Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.

I know it's an old question but just as an update for those who still looking for an answer to this question:
The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2.
You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:
import pickle
# ...
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.

I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data.
Shelve is in fact a dirty solution.
That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed
And that is why shelve is a dirty solution... It's still faster though. So!

You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.

That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.