I have a large object in my Python3 code which, when tried to be pickled with the pickle module throws the following error:
TypeError: cannot serialize '_io.BufferedReader' object
However, dill.dump() and dill.load() are able to save and restore the object seamlessly.
What causes the trouble for the pickle module?
Now that dill saves and reconstructs the object without any error, is there any way to verify if the pickling and unpickling with dill went well?
How's it possible that pickle fails, but dill succeeds?
I'm the dill author.
1) Easiest thing to do is look at this file: https://github.com/uqfoundation/dill/blob/master/dill/_objects.py, it lists what pickle can serialize, and what dill can serialize.
2) you can try dill.copy and dill.check and dill.pickles to check different levels of pickling and unpickling. dill also more includes utilities for detecting and diagnosing serialization issues in dill.detect and dill.pointers.
3) dill is built on pickle, and augments it by registering new serialization functions.
4) dill includes serialization variants which enable the user to choose from different object dependency serialization strategies (in dill.settings) -- including source code extraction and object reconstitution with dill.source (and extension of the stdlib inspect module).
Related
Can anybody help me to understand how and why to choose among e.g. pickle and dill?
My use case is the following.
I would like to dump an object which instance of a class derived by multiple inheritance from some external library classes. Moreover one attribute of the class is a dictionary, which has a function as a value.
Unfortunately, that function is defined within the scope of the class.
def class:
def f():
def that_function():
# do someth
# back within f() scope
self.mydata{'foo':that_function}
Any comment regard to robustness to external dependencies?
Or any other library I could consider for serialization?
I'm the dill author. You should use pickle if all the objects you want to pickle can be pickled by pickle.dump. If one or more of the objects are unpicklable with pickle, then use dill. See the pickle docs for what can be pickled with pickle. dill can pickle most python objects, with some exceptions.
If you want to consider alternatives to dill, there's cloudpickle, which has a similar functionality to dill (and is very similar to dill when using dill.settings['recurse'] = True).
There are other serialization libraries, like json, but they actually serialize less objects than pickle does, so you'd not choose them to serialize a user-built class.
I need to serialize and transfer Python objects between interpreters, and pickle is an ideal choice. However, pickle allows the user of mu lib to serialize globa functions, classes (and their instances) and modules. However, these may not be present on the receiving end, e.g. this runs without any errors:
def user_func():
return 42
pickle.dumps({'function': user_func})
So: How can I convince pickle to reject anything that is not a built-in type (or with a similar restriction)?
JSON and other universal formats are not a really good solution, as pickle allows Python-native data types such as set() and data structures with circular/shared references.
Recently, I have been asked to make "our C++ lib work in the cloud".
Basically, the lib is computer intensive (calculating prices), so it would make sense.
I have constructed a SWIG interface to make a python version with in the mind to use MapReduce with MRJob.
I wanted to serialize the objects in a file, and using a mapper, deserialize and calculate the price.
For example:
class MRTest(MRJob):
def mapper(self,key,value):
obj = dill.loads(value)
yield (key, obj.price())
But now I reach a dead end since it seems that dill cannot handle SWIG extension:
PicklingError: Can't pickle <class 'SwigPyObject'>: it's not found as builtins.SwigPyObject
Is there a way to make this work properly?
I'm the dill author. That's correct, dill can't pickle C++ objects. When you see it's not found as builtin.some_object… that almost invariably means that you are trying to pickle some object that is not written in python, but uses python to bind to C/C++ (i.e. an extension type). You have no hope of directly pickling such objects with a python serializer.
However, since you are interested in pickling a subclass of an extension type, you can actually do it. All you will need to do is to give your object the appropriate state you want to save as an instance attribute or attributes, and provide a __reduce__ method to tell dill (or pickle) how to save the state of your object. This method is how python deals with serializing extension types. See:
https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-extension-types
There are probably better examples, but here's at least one example:
https://stackoverflow.com/a/19874769/4646678
I have noticed that loading a dictionary of 5000 objects with pickle takes a long time (minutes) -- but loading a json of file of 5000 entities takes a short time (seconds). I know that in general objects come with some overhead -- and that in OOP the overhead associated with keeping track of such objects is part of the cost for the ease using them. But why does loading an pickled object take SO long. What is happening under the hood? What are the costs associated with serializing an object as opposed to merely writing its data to a file? Does pickling restore the object to the same locations in memory or something? (Maybe moving other objects out of the way). If serialization loads slower (at least pickle is) than what is the benefit?
Assuming that you are using the Python 2.7 standard pickle and json modules...
Python 2.7 uses a pure-Python implementation of the pickle module by default, although a faster C implementation is available. http://docs.python.org/2/library/pickle.html
Conversely, Python 2.7 uses an optimized C implementation of the json module by default: http://docs.python.org/dev/whatsnew/2.7.html
So you're basically comparing a pure-Python deserializer to an optimized C deserializer. Not a fair comparison, even if the serialization formats were identical.
There are speed comparisons out there for the serialization of particular objects, comparing JSON and pickle and cPickle. The speed of each object will be different in each format. JSON is usually comparably faster than pickle, and you often hear not to use pickle because it's insecure. The reason for security concerns, and some of the speed lag, is that pickle doesn't actually serialize very much data -- instead it serializes some data and a bunch of instructions, where the instructions are used to assemble the python objects. If you've ever looked at the dis module, you'll see the type of instructions that pickle uses for each object. cPickle is, like json, not pure-python, and leverages optimized C, so it's often faster.
Pickling should take up less space, in general than storing an object itself -- in general, however, some instruction sets can be quite large. JSON tends to be smaller… and is human readable… however, since json stores everything as human-readable strings… it can't serialize as many different kinds of objects as pickle and cPickle can. So the trade-off is json for "security" (or inflexibility, depending on your perspective) and hunan-readability versus pickle with a broader range of objects it can serialize.
Another good reason for choosing pickle (over json) is that you can easily extend pickle, meaning that you can register a new method to serialize an object that pickle doesn't know how to pickle. Python gives you several ways to do that… __getstate__ and __setstate__ as well as the copy_reg method. Using these methods, you'll find that people have extended pickle to serialize most of python objects, for example dill.
Pickling doesn't restore the objects to the same memory location. However, it does reconstitute the object to the same state (generally) as when it was pickled. If you want to see some reasons why people pickle, take a look here:
Python serialization - Why pickle?
http://nbviewer.ipython.org/gist/minrk/5241793
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
I'm attempting to broadcast a module to other python processes with MPI. Of course, a module itself isn't pickleable, but the __dict__ is. Currently, I'm pickling the __dict__ and making a new module in the receiving process. This worked perfectly with some simple, custom modules. However, when I try to do this with NumPy, there's one thing that I can't pickle easily: the ufunc.
I've read this thread that suggests pickling the __name__ and __module__ of the ufunc, but it seems they rely on having numpy fully built and present before they rebuild it. I need to avoid using the import statement all-together in the receiving process, so I'm curious if the getattr(numpy,name) statement mentioned would work with a module that doesn't have ufuncs included yet.
Also, I don't see a __module__ attribute on the ufunc in the NumPy documentation:
http://docs.scipy.org/doc/numpy/reference/ufuncs.html
Any help or suggestions, please?
EDIT: Sorry, forgot to include thread mentioned above. http://mail.scipy.org/pipermail/numpy-discussion/2007-January/025778.html
Pickling a function in Python only serializes its name and the module it comes from. It does not transport code over the wire, so when unpickling you need to have the same libraries available as when pickling. On unpickling, Python simply imports the module in question, and grabs the items via getattr. (This is not limited to Numpy, but applies to pickling in general.)
Ufuncs don't pickle cleanly, which is a wart. Your options mainly are then to pickle just the __name__ (and maybe the __class__) of the ufunc, and reconstruct them later on manually. (They are not actually Python functions, and do not have a __module__ attribute.)