Fastest way to save and load a large dictionary in Python

Fastest way to save and load a large dictionary in Python - python

I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks

Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.

I know it's an old question but just as an update for those who still looking for an answer to this question:
The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2.
You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:
import pickle
# ...
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.

I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data.
Shelve is in fact a dirty solution.
That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed
And that is why shelve is a dirty solution... It's still faster though. So!

You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.

That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?

Related

Which one is faster in production? file on disk vs file in memory(StringIO,BytesIO)

I am converting a dictionary to pandas object with to_csv. I have both way of doing this
1 - by writing file in disk(with open statement)
2 - by writing in memory (StringIO,BytesIO)
I have used it in both way creating file in disk and using StringIO to convert to pandas object. I tried to read comparisons between these three, but bit confused which one is faster so i can use it in production to process tons of data.

Writing and reading from memory is fast. But keep in mind that you have tons of data. So storing all that in-memory might take up all your memory and might make the system slow or might throw errors due to Out of Memory. So, analyze and understand which all data to be put in memory and which all to be written to files.

In general - writing to RAM (memory) will be faster.
However, you might want to use Iterators (saving memory using iterators) if you have too much data, because your machine might will run out-of-memory, or just will write a lot to your SWAP file (in short - that's an "extension" of your RAM in your hard drive, you can read about it here), which will hurt your performance, a lot.
For benchmarking, if your code is pretty simple - I would recommenced using timeit, but there are even better resources for that, such as this one, from scipy

Python text file instead of dictionary

I'm working on a project where I crawl and re-organize huge number of data into a result text file. Previously I used dictionary to store temporary data, but as the data volume increased the process slowed down because of memory usage and dictionary got useless.
Since process speed is not so important in my case, I'm trying to replace dictionary to file but I'm not sure how can I easily move file pointer to appropriate position and read up required data. In dictionary I can easily refer to any data. I would like to achieve the same but in file.
I'm thinking to use mmap and write my own functions to move file pointer where I want. Does Python have a built-in or 3rd party module for such operations?
Any other theoretical approach is welcome.

I think you are now trying to reinvent a key-value database.
Maybe the easiest thing would be to check if sqlite3 module would offer you what you need. Using a readymade database is easier than rolling your own!
Of course, sqlite3 is not a key-value DB (on the surface), so if you need something even simpler, have a look at LMDB and its Python bindings: http://lmdb.readthedocs.org/en/release/
It is as lightweight and fast as it gets. It is probably close to the fastest way to achieve what you want.
It should be noted that there is no such thing as an optimal key-value database. There are several aspects to consider. At least:
Do you read a lot or write a lot?
What are the key and value sizes?
Do you need transactions/crash-proof?
Do you have duplicate keys (one key, several values)?
Do you want to have sorted keys?
Do you want to read the data out in the same order it is inserted?
What is your database size (MB, GB, TB, PB)?
Are you constrained on IO or CPU?
For example, LMDB I suggested above is very good in read-intensive tasks, not so much in write-intensive tasks. It offers transactions, keeps keys in sorted order and is crash-proof (limited by the underlying file system). However, if you need to write your database often, LMDB may not be the best choice.
On the other hand, SQLite is not the perfect choice to this task - theoretically speking. In practice, it is built in into the standard Python distribution and is thus easy to use. It may provide adequate performance, and it may thus be the best choice.
There are numerous high-quality databases out there. By not mentioning them I do not try to give the impression that the DBs mentioned in this answer are the only good alternatives. Most database managers have a very good reason for their existence. While there are some that are a bit outdated, most have their own sweet spots in the application area.
The field is constantly changing. There are both completely new databases available and old database systems are updated. This should be kept in mind when reading old benchmarks. Also, the type of HW used has its impact; a computer with a SSD disk, a cloud computing instance, and a traditional computer with a HDD behave quite differently performance-wise.

Efficiently write gigabytes of data to disk in Python

On Python v2.7 in Windows and Linux, what is the most efficient and quick way to sequentially write 5GB of data to a local disk (fixed or removable)? This data will not soon be read and does not need cached.
It seems the normal ways of writing use the OS disk cache (because the system assumes it may re-read this data soon). This clears useful data of of the cache, making the system slower.
Right now I am using f.write() with 65535 bytes of data at a time.

The real reason your OS uses the disk cache isn't that it assumes the data will be re-read -- it's that it wants to speed up the writes. You want to use the OS's write cache as aggressively as you possibly can.
That being said, the "standard" way to do high-performance, high-volume I/O in any language (and probably the most aggressive way to use the OS's read/write caches) is to use memory-mapped I/O. The mmap module (https://docs.python.org/2/library/mmap.html) will provide that, and depending on how you generate your data in the first place, you might even be able to gain more performance by dumping it to the buffer earlier.
Note that with a dataset as big as yours, it'll only work on a 64-bit machine (Python's mmap on 32-bit is limited to 4GiB buffers).
If you want more specific advice, you'll have to give us more info on how you generate your data.

This answer is relevant for Windows code, I have no idea about the Linux equivalent though I imagine the advice is similar.
If you want to be write the fastest code possible, then write using the Win32API and make sure you read the relevant section of CreateFile. Specifically make sure you do not make the classic mistake of using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags to open a file, for more explanation see Raymond Chen's classic blog post.
If you insist of writing at some multiple of sector or cluster size, then don't be beholden to the magic number of 65535 (why this number? It's no real multiple). Instead using GetDiskFreeSpace figure out the appropriate sector size, though even this is no real guarantee (some data may be kept with the NTFS file information).

handling huge arrays in python

My program runs a simulation which requires huge objects to store the data in. The size of the blob is larger than 2-3GB. Even though I should have anough memory in my MBP, python (Python 2.7.3 on Mac OS X, from ports) cannot seem to use it all, and the system gets totally frozen.
To save the status of the simulation, I use pickle, but it also doesn't work for too large objects, it seems as if pickle would duplicate the objects in the memory before dumping them...
QUESTION: is there a standard library which can handle huge python data structures (dict, set, list) without keeping them in the memory all the time? Alternatively is there a way to force python to run in virtual memory? (I'm not very familiar with numpy, would it help me in this situation?)
Thanks in advance!

If you are using the 64bit version of Python and still run into problems with pickle or other built-in modules, you can store the Python objects in an object-orientated database instead. We're working with large objects (~10GB) here everyday and use ZODB for that. It's not the fastest but gets the job done.
I also hear that dobbin might be a good alternative.

How to deserialize 1GB of objects into Python faster than cPickle?

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.