I have a couple of numpy matrices (3-dimensional to be exact) which are stored in tuples
(a1,b1,c1)
(a2,b2,c2)
...
(an,bn,cn)
I would like to serialize each tuple into a file that can be read back into Python on another machine (Linux => Windows, both are x86-64). What would be a pythonic way to accomplish this?
numpy.savez or numpy.savez_compressed is the way to go. I've heard, but never experienced issues with certain types of arrays not pickling well.
I'm recalling this post (doesn't seem to have been much of an issue) as well as something about numpy.void not pickling. Likely not an issue, but there it is.
Pickle will probably work well
I also saw this: http://thsant.blogspot.com/2007/11/saving-numpy-arrays-which-is-fastest.html
Use shelve, pickle, cPickle, or shove. Each of these will let you store most kinds of python objects in a file; shove and shelve focus on dictionary-like objects that map keys to values, and shove will let you use a variety of database-like backends. If you find yourself exceeding the performance limitations on these libraries, consider going the database route, e.g. through SQLAlchemy.
I've used each of these libraries, and they work reasonably well within their own niche. I'd start with pickle or shelve, which are standard library.
I generally use cPickle, although I haven't done a formal comparison with other methods. Additionally, I always write the file as binary and use highest protocol setting:
f = open('fname.pkl','wb')
cPickle.dump(array_tuple,f,-1)
f.close()
Related
Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.
The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.
As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.
However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.
Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?
The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.
I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.
-edit-
You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.
Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:
def foo():
os.system('rm -rf /')
return {'lol': foo}
But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.
I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.
I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.
I have considered:
a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
a SQLite database (additional dependency not available by default)
custom disk-based balanced tree structure, such as B-tree or similar
Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?
After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.
(This is for a rewrite of shaback if you're curious.)
Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).
I think this is what databases like sqlite are made for. Is there a reason you can't use it?
You could use a DBM style database. I'm doing a similar thing with dbm, just storing all the keys with a value of '1'. Since it's BSD, the dbhash module should work. (it's deprecated, so no Python 3; and not a great idea for long-term use because of that). Otherwise, use the modules gdbm (dbm.gdbm in Python 3) and ndbm(dbm.dbm in Python 3). There's also the module dumbdbm(dbm.dumbdbm in Python 3) which is pure python and always works, but a bit slower. Also, if you are going to have multiple simultaneous reads and writes, definitely do not use the dumbdbm module.
The various dbm modules all work just like a python dictionary, except the keys and the values need to be strings. You can use the "in" keyword just like you would for a set, or a dict.
Dbm and setting the second value as an arbitrary value of 1 as Brian Minton suggested is a convenient solution. cPickle is good too
However, You should also consider using json. Check google but AFAIK, it seems that the json parser is faster than Pickle/cPickle. (e.g., http://kovshenin.com/2010/pickle-vs-json-which-is-faster/)
I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks
Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.
I know it's an old question but just as an update for those who still looking for an answer to this question:
The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2.
You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:
import pickle
# ...
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.
I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data.
Shelve is in fact a dirty solution.
That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed
And that is why shelve is a dirty solution... It's still faster though. So!
You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.
That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?
I'm currently working on a project where I need to transfer objects from ruby to python and back again, obviously serialization is the way to go. I've looked at things like yaml but decided to write my own as I didn't want to deal with the dependencies of the libraries and such when it came time to distribute. I've wrote up how this serialization format works here.
my question is that as this format is intended to work cross language between ruby and python,
how should I serialize ruby's symbols? I'm not aware of a object that works the same way in python. should a dump containing a symbol fail? should I just serialize it as a string? what would be best?
Doesn't that depend on what your project needs? If symbols are important, you'll need some way to deal with them.
I'm not a Ruby programmer, but from what I've just read, I think converting them to strings is probably easiest. The standard Python interpreter will reuse memory for identical short strings, which seems to be a key reason suggested for using symbols.
EDIT: If it needs to work for other programmers, passing values back and forth shouldn't change them. So you either have to handle symbols properly, or throw an error straight away. It should be simple enough in Python:
class Symbol(str):
pass
# In serialising code:
if isinstance(x, Symbol):
serialise_as_symbol(x)
Any reason you're not using a standard data interchange format like JSON or XML? They seem to be acceptable to countless applications, services, and programmers.
If symbols are a stumbling block then you have three choices, don't allow them, convert them to strings on the fly or figure out a way to make them universal and/or innocuous in other languages.
Looking for advice on the best technique for saving complex Python data structures across program sessions.
Here's a list of techniques I've come up with so far:
pickle/cpickle
json
jsonpickle
xml
database (like SQLite)
Pickle is the easiest and fastest technique, but my understanding is that there is no guarantee that pickle output will work across various versions of Python 2.x/3.x or across 32 and 64 bit implementations of Python.
Json only works for simple data structures. Jsonpickle seems to correct this AND seems to be written to work across different versions of Python.
Serializing to XML or to a database is possible, but represents extra effort since we would have to do the serialization ourselves manually.
Thank you,
Malcolm
You have a misconception about pickles: they are guaranteed to work across Python versions. You simply have to choose a protocol version that is supported by all the Python versions you care about.
The technique you left out is marshal, which is not guaranteed to work across Python versions (and btw, is how .pyc files are written).
You left out the marshal and shelve modules.
Also this python docs page covers persistence
Have you looked at PySyck or pyYAML?
What are your criteria for "best" ?
pickle can do most Python structures, deeply nested ones too
sqlite dbs can be easily queried (if you know sql :)
speed / memory ? trust no benchmarks that you haven't faked yourself.
(Fine print:
cPickle.dump(protocol=-1) compresses, in one case 15M pickle / 60M sqlite, but can break.
Strings that occur many times, e.g. country names, may take more memory than you expect;
see the builtin intern().
)