Can anybody help me to understand how and why to choose among e.g. pickle and dill?
My use case is the following.
I would like to dump an object which instance of a class derived by multiple inheritance from some external library classes. Moreover one attribute of the class is a dictionary, which has a function as a value.
Unfortunately, that function is defined within the scope of the class.
def class:
def f():
def that_function():
# do someth
# back within f() scope
self.mydata{'foo':that_function}
Any comment regard to robustness to external dependencies?
Or any other library I could consider for serialization?
I'm the dill author. You should use pickle if all the objects you want to pickle can be pickled by pickle.dump. If one or more of the objects are unpicklable with pickle, then use dill. See the pickle docs for what can be pickled with pickle. dill can pickle most python objects, with some exceptions.
If you want to consider alternatives to dill, there's cloudpickle, which has a similar functionality to dill (and is very similar to dill when using dill.settings['recurse'] = True).
There are other serialization libraries, like json, but they actually serialize less objects than pickle does, so you'd not choose them to serialize a user-built class.
Related
I have a large object in my Python3 code which, when tried to be pickled with the pickle module throws the following error:
TypeError: cannot serialize '_io.BufferedReader' object
However, dill.dump() and dill.load() are able to save and restore the object seamlessly.
What causes the trouble for the pickle module?
Now that dill saves and reconstructs the object without any error, is there any way to verify if the pickling and unpickling with dill went well?
How's it possible that pickle fails, but dill succeeds?
I'm the dill author.
1) Easiest thing to do is look at this file: https://github.com/uqfoundation/dill/blob/master/dill/_objects.py, it lists what pickle can serialize, and what dill can serialize.
2) you can try dill.copy and dill.check and dill.pickles to check different levels of pickling and unpickling. dill also more includes utilities for detecting and diagnosing serialization issues in dill.detect and dill.pointers.
3) dill is built on pickle, and augments it by registering new serialization functions.
4) dill includes serialization variants which enable the user to choose from different object dependency serialization strategies (in dill.settings) -- including source code extraction and object reconstitution with dill.source (and extension of the stdlib inspect module).
I need to serialize and transfer Python objects between interpreters, and pickle is an ideal choice. However, pickle allows the user of mu lib to serialize globa functions, classes (and their instances) and modules. However, these may not be present on the receiving end, e.g. this runs without any errors:
def user_func():
return 42
pickle.dumps({'function': user_func})
So: How can I convince pickle to reject anything that is not a built-in type (or with a similar restriction)?
JSON and other universal formats are not a really good solution, as pickle allows Python-native data types such as set() and data structures with circular/shared references.
Recently, I have been asked to make "our C++ lib work in the cloud".
Basically, the lib is computer intensive (calculating prices), so it would make sense.
I have constructed a SWIG interface to make a python version with in the mind to use MapReduce with MRJob.
I wanted to serialize the objects in a file, and using a mapper, deserialize and calculate the price.
For example:
class MRTest(MRJob):
def mapper(self,key,value):
obj = dill.loads(value)
yield (key, obj.price())
But now I reach a dead end since it seems that dill cannot handle SWIG extension:
PicklingError: Can't pickle <class 'SwigPyObject'>: it's not found as builtins.SwigPyObject
Is there a way to make this work properly?
I'm the dill author. That's correct, dill can't pickle C++ objects. When you see it's not found as builtin.some_object… that almost invariably means that you are trying to pickle some object that is not written in python, but uses python to bind to C/C++ (i.e. an extension type). You have no hope of directly pickling such objects with a python serializer.
However, since you are interested in pickling a subclass of an extension type, you can actually do it. All you will need to do is to give your object the appropriate state you want to save as an instance attribute or attributes, and provide a __reduce__ method to tell dill (or pickle) how to save the state of your object. This method is how python deals with serializing extension types. See:
https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-extension-types
There are probably better examples, but here's at least one example:
https://stackoverflow.com/a/19874769/4646678
Using Python 2.7,
I am passing many large objects across processes using a manager derived from multiprocessing.managers. BaseManager and I would like to use cPickle as the serializer to save time; how can this be done? I see that the BaseManager initializer takes a serializer argument, but the only options appear to be pickle and xmlrpclib.
It seems like you can't strictly do what your asking.
In fact, there's a fork of multiprocessing, pathos, written by the creators of an alternative to pickle, dill - also because of the limited ability to control the serializer.
I would personally suggest you use ipython.parallel, as it seems more actively maintained.
See more details on this matter in this piece Parallelism and Serialization.
I'm the author of dill and pathos. Multiprocessing should use cPickle by default, so you should't have to do anything.
If your object doesn't searliize, you have two options: go to a fork of multiprocessing or some other parallel backend, or add methods to your class (i.e. reduce methods) that register how to serialize the object.
I have a python module called model with basically the following content:
class Database:
class Publiation(object):
pass
class Article(Publication):
pass
class Book(Publication):
pass
class AnotherDatabase:
class Seminar(object):
pass
...
I define the objects in the database as classes nested under a main class in order to organize them more distinctively. The objects are parsed from a large XML file, which takes time. I would like to pickle the imported objects to make them loadable in shorter time.
I get the error:
pickle.PicklingError: Can't pickle
: it's
not found as project.model.Article
The class is now project.model.Article, not project.model.Database.Article as defined. Can I fix this error and keep the classes nested like above? Is it a bad idea to organize classes by nesting them?
When an inner class is created, there is no way for the interpreter to know which class it was defined inside of, this information is not recorded. This is why pickle does not know where to look for the class Article.
Because of this there are numerous issues when using inner classes, not just when it comes to pickling. If there are classes at the module scope with the same name, it introduces a lot of ambiguity as there is no easy way to tell the two types apart (e.g. with repr or when debugging.)
As a result it is generally best to avoid nested classes in Python unless you have a very good reason for doing so.
It's certainly a lot simpler to keep your classes unnested. As an alternative, you can use packages to group the classes together.
In any case, there is an alternate serializer named cerealizer which I think could handle the nested classes. You would need to register the classes with it before deserialization. I've used it before when pickle wouldn't suffice (also problems related to the classes) and it works well!