Limit Python pickle to simple types - python

I need to serialize and transfer Python objects between interpreters, and pickle is an ideal choice. However, pickle allows the user of mu lib to serialize globa functions, classes (and their instances) and modules. However, these may not be present on the receiving end, e.g. this runs without any errors:
def user_func():
return 42
pickle.dumps({'function': user_func})
So: How can I convince pickle to reject anything that is not a built-in type (or with a similar restriction)?
JSON and other universal formats are not a really good solution, as pickle allows Python-native data types such as set() and data structures with circular/shared references.

Related

which python library to serialize a poorly designed class

Can anybody help me to understand how and why to choose among e.g. pickle and dill?
My use case is the following.
I would like to dump an object which instance of a class derived by multiple inheritance from some external library classes. Moreover one attribute of the class is a dictionary, which has a function as a value.
Unfortunately, that function is defined within the scope of the class.
def class:
def f():
def that_function():
# do someth
# back within f() scope
self.mydata{'foo':that_function}
Any comment regard to robustness to external dependencies?
Or any other library I could consider for serialization?
I'm the dill author. You should use pickle if all the objects you want to pickle can be pickled by pickle.dump. If one or more of the objects are unpicklable with pickle, then use dill. See the pickle docs for what can be pickled with pickle. dill can pickle most python objects, with some exceptions.
If you want to consider alternatives to dill, there's cloudpickle, which has a similar functionality to dill (and is very similar to dill when using dill.settings['recurse'] = True).
There are other serialization libraries, like json, but they actually serialize less objects than pickle does, so you'd not choose them to serialize a user-built class.

Is there any difference between Pickling and Serialization?

I have came across these two terms so often while reading about python objects. However, there is a confusion between pickling and serialization since at one place I read
The pickle module implements an algorithm for turning an arbitrary
Python object into a series of bytes. This process is also called
serializing” the object.
If serializing and pickling is same process, why use different terms for them?
You are misreading the article. Pickling and serialisation are not synonymous, nor does the text claim them to be.
Paraphrasing slighly, the text says this:
This module implements an algorithm for turning an object into a series of bytes. This process is also called serializing the object.
I removed the module name, pickle, deliberately. The module implements a process, an algorithm, and that process is commonly known as serialisation.
There are other implementations of that process. You could use JSON or XML to serialise data to text. There is also the marshal module. Other languages have other serialization formats; the R language has one, so does Java. Etc.
See the WikiPedia article on the subject:
In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.
Python picked the name pickle because it modelled the process on how this was handled in Modula-3, where it was also called pickling. See Pickles: Why are they called that?
in python pickle refers to a module that provides (a specific) serialization of python objects.
serialization itself is a more general term. python objects can also be serialized into json for example.
https://en.wikipedia.org/wiki/Serialization

Why can't I pickle this python class using pickle? [duplicate]

Python docs mention this word a lot and I want to know what it means.
It simply means it can be serialized by the pickle module. For a basic explanation of this, see What can be pickled and unpickled?. Pickling Class Instances provides more details, and shows how classes can customize the process.
Things that are usually not pickable are, for example, sockets, file(handler)s, database connections, and so on. Everything that's build up (recursively) from basic python types (dicts, lists, primitives, objects, object references, even circular) can be pickled by default.
You can implement custom pickling code that will, for example, store the configuration of a database connection and restore it afterwards, but you will need special, custom logic for this.
All of this makes pickling a lot more powerful than xml, json and yaml (but definitely not as readable)
These are all great answers, but for anyone who's new to programming and still confused here's the simple answer:
Pickling an object is making it so you can store it as it currently is, long term (to often to hard disk). A bit like Saving in a video game.
So anything that's actively changing (like a live connection to a database) can't be stored directly (though you could probably figure out a way to store the information needed to create a new connection, and that you could pickle)
Bonus definition: Serializing is packaging it in a form that can be handed off to another program. Unserializing it is unpacking something you got sent so that you can use it
Pickling is the process in which the objects in python are converted into simple binary representation that can be used to write that object in a text file which can be stored. This is done to store the python objects and is also called as serialization. You can infer from this what de-serialization or unpickling means.
So when we say an object is picklable it means that the object can be serialized using the pickle module of python.

Why does loading a pickle object take so much longer than loading a file?

I have noticed that loading a dictionary of 5000 objects with pickle takes a long time (minutes) -- but loading a json of file of 5000 entities takes a short time (seconds). I know that in general objects come with some overhead -- and that in OOP the overhead associated with keeping track of such objects is part of the cost for the ease using them. But why does loading an pickled object take SO long. What is happening under the hood? What are the costs associated with serializing an object as opposed to merely writing its data to a file? Does pickling restore the object to the same locations in memory or something? (Maybe moving other objects out of the way). If serialization loads slower (at least pickle is) than what is the benefit?
Assuming that you are using the Python 2.7 standard pickle and json modules...
Python 2.7 uses a pure-Python implementation of the pickle module by default, although a faster C implementation is available. http://docs.python.org/2/library/pickle.html
Conversely, Python 2.7 uses an optimized C implementation of the json module by default: http://docs.python.org/dev/whatsnew/2.7.html
So you're basically comparing a pure-Python deserializer to an optimized C deserializer. Not a fair comparison, even if the serialization formats were identical.
There are speed comparisons out there for the serialization of particular objects, comparing JSON and pickle and cPickle. The speed of each object will be different in each format. JSON is usually comparably faster than pickle, and you often hear not to use pickle because it's insecure. The reason for security concerns, and some of the speed lag, is that pickle doesn't actually serialize very much data -- instead it serializes some data and a bunch of instructions, where the instructions are used to assemble the python objects. If you've ever looked at the dis module, you'll see the type of instructions that pickle uses for each object. cPickle is, like json, not pure-python, and leverages optimized C, so it's often faster.
Pickling should take up less space, in general than storing an object itself -- in general, however, some instruction sets can be quite large. JSON tends to be smaller… and is human readable… however, since json stores everything as human-readable strings… it can't serialize as many different kinds of objects as pickle and cPickle can. So the trade-off is json for "security" (or inflexibility, depending on your perspective) and hunan-readability versus pickle with a broader range of objects it can serialize.
Another good reason for choosing pickle (over json) is that you can easily extend pickle, meaning that you can register a new method to serialize an object that pickle doesn't know how to pickle. Python gives you several ways to do that… __getstate__ and __setstate__ as well as the copy_reg method. Using these methods, you'll find that people have extended pickle to serialize most of python objects, for example dill.
Pickling doesn't restore the objects to the same memory location. However, it does reconstitute the object to the same state (generally) as when it was pickled. If you want to see some reasons why people pickle, take a look here:
Python serialization - Why pickle?
http://nbviewer.ipython.org/gist/minrk/5241793
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

What available Python modules are there to save-and-load data?

There are many scattered posts out on StackOverflow, regarding Python modules used to save and load data.
I myself am familiar with json and pickle and I have heard of pytables too. There are probably more out there. Also, each module seems to fit a certain purpose and has its own limits (e.g. loading a large list or dictionary with pickle takes ages if working at all). Hence it would be nice to have a proper overview of possibilities.
Could you then help providing a comprehensive list of modules used to save and load data, describing for each module:
what the general purpose of the module is,
its limits,
why you would choose this module over others?
marshal:
Pros:
Can read and write Python values in a binary format. Therefore it's much faster than pickle (which is character based).
Cons:
Not all Python object types are supported. Some unsupported types such as subclasses of builtins will appear to marshal and unmarshal correctly
Is not intended to be secure against erroneous or maliciously constructed data.
The Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise
shelve
Pros:
Values in a shelf can be essentially arbitrary Python objects
Cons:
Does not support concurrent read/write access to shelved objects
ZODB (suggested by #Duncan)
Pro:
transparent persistence
full transaction support
pluggable storage
scalable architecture
Cons
not part of standard library.
unable (easily) to reload data unless the original python object model used for persisting is available (consider version difficulties and data portability)
There is an overview of the standard lib data persistence modules.

Categories

Resources