Python: why pickle?

Python: why pickle? - python

I have been using pickle and was very happy, then I saw this article: Don't Pickle Your Data
Reading further it seems like:
Pickle is slow
Pickle is unsafe
Pickle isn’t human readable
Pickle isn’t language-agnostic
I’ve switched to saving my data as JSON, but I wanted to know about best practice:
Given all these issues, when would you ever use pickle? What specific situations call for using it?

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:
Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.
In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.
You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

Pickle has the advantage of convenience -- it can serialize arbitrary object graphs with no extra work, and works on a pretty broad range of Python types. With that said, it would be unusual for me to use Pickle in new code. JSON is just a lot cleaner to work with.

I usually use neither Pickle, nor JSON, but MessagePack it is both safe and fast, and produces serialized data of small size.
An additional advantage is possibility to exchange data with software written in other languages (which of course is also true in case of JSON).

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.
import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np
num_tests = 10
obj = np.random.normal(0.5, 1, [240, 320, 3])
command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)
command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle: %f seconds" % result)
command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest: %f seconds" % result)
command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json: %f seconds" % result)
command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack: %f seconds" % result)
Output:
pickle : 0.847938 seconds
cPickle : 0.810384 seconds
cPickle highest: 0.004283 seconds
json : 1.769215 seconds
msgpack : 0.270886 seconds
So, I prefer cPickle with the highest dumping protocol in situations that require real time performance such as video streaming from a camera to
a server.

You can find some answer on JSON vs. Pickle security: JSON can only pickle unicode, int, float, NoneType, bool, list and dict. You can't use it if you want to pickle more advanced objects such as classes instance. Note that for those kinds of pickle, there is no hope to be language agnostic.
Also using cPickle instead of Pickle partially resolve the speed progress.

Related

Multiprocessing Z3 in Python

I have a large list of various types of objects I would like Z3 to synthesize in my Python project. Since constraints associated with each object to be synthesized are independent, this process can be completely parallelized. That is, instead of synthesizing one value at a time, if I have a machine with 4 cores, I can synthesize 4 values at the same time. To do this, we must use Python's multiprocessing package instead of threading (due to GIL and the fact that the workload should be CPU-bound).
For simplicity, say I have a simple str synthesizer that synthesizes a new str that is lexicographically less than a given input value, something like this:
def lt_constraint(value):
solver = Solver()
# do a number of processing on 'value', which is an input string
# ... define char and _chars in code here
template = Concat(Re(StringVal(value[:offset])), char, Star(_chars))
solver.add(InRe(String("var"), template))
if solver.check() == sat:
value = solver.model()[self.var]
return convert_to_str(value)
Now if I have a number of values, I want to run the function above in parallel:
from pathos.multiprocessing import ProcessingPool as Pool
with Pool(processes=4) as pool:
value_list = ['This', 'is', 'an', 'example']
synthesized_strs = pool.map(lt_constraint, value_list)
I use pathos hoping that it will handle pickling issue, but I still received this error:
TypeError: cannot pickle 're.Match' object
which I believe is because Z3 uses methods in re and they need to be pickled when pickling lt_constraint(), but dill cannot pickle those.
Is there any other way to parallelize Z3 for my case (other than implementing pickling myself for re or what not)?
Thanks!

Stack-overflow works the best when you include your whole code, so people can experiment with it. Having said that, I had good luck with the following:
from z3 import *
import concurrent.futures
def getVal(value):
solver = Solver()
var = Int('var')
solver.add(var > value)
if solver.check() == sat:
return solver.model()[var].as_long()
else:
return 'CANT SOLVE'
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(getVal, i) for i in [1, 2, 3]]
results = [f.result() for f in futures]
print(results)
This prints:
$ python3.9 a.py
[2, 3, 4]
Without actually having the details of how you constructed your lt_constraint, it's hard to tell whether this'll work for your case. But it seems using the concurrent.futures library works well with z3; so far as simple constraints are used. Give this a try and see if it handles your case as well. If not; please post the full-code as an minimal-reproducible example. See https://stackoverflow.com/help/minimal-reproducible-example

Why is this call of len(s) so slow?

According to this answer, the call len(s) has a complexity of O(1).
Then why is it, that calling it on a downloaded 27kb file so much slower than on a 1kb file?
27kb
>>> timeit.timeit('x = len(r.text)', 'from requests import get; r = get("https://cdn.discordapp.com/attachments/280190011918254081/293010649754370048/Journal.170203183244.01.log")', number = 20)
5.78126864130499
1kb
>>> timeit.timeit('x = len(r.text)', 'from requests import get; r = get("https://cdn.discordapp.com/attachments/280190011918254081/293016636288663562/Journal.170109120508.01.log")', number = 20)
0.00036539355403419904
The problem is, that this example ran on my dev-machine, which is a normal work pc. The machine where the code should run on is a RaspberryPi, which is orders of magnitude slower.

Try assigning r.text to a local variable during your setup phase. It's a lazy property, not a plain attribute, and you're timing the work of constructing the value, which decodes from the internally cached bytes to str, not just the len call.
Hat tip to Martijn Pieters for the precise references!

Simplify statement '.'.join( string.split('.')[0:3] )

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'
Therefore I wonder:
Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
Or is the python interpreter smart enough?
(I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)
Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
(You can provide specific Python2 and/or Python3 tricks and tips)

I have no idea of the CPU usage, for this purpose, but isn't it why we use high level languages in some way?
Another solution would be using regular expressions, using compiled pattern should allow background optimisations:
import re
version = '1.2.3.4.5-RC4'
pat = re.compile('^(\d+\.\d+\.\d+)')
res = re.match(version)
if res:
print res.group(1)
Edit: As suggested #jonrsharpe, I did also run the timeit benchmark. Here are my results:
def extract_vers(str):
res = pat.match(str)
if res:
return res.group(1)
else:
return False
>>> timeit.timeit("api1(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.9013631343841553
>>> timeit.timeit("api2(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.3482811450958252
>>> timeit.timeit("extract_vers(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.174590826034546
Edit: But anyway, some lib exist in Python, such as distutils.version to do the job.
You should have a look on that answer.

To answer your first question: no, this will not be optimised out by the interpreter. Python will create a list from the string, then create a second list for the slice, then put the list items back together into a new string.
To cover the second, you can optimise this slightly by limiting the split with the optional maxsplit argument:
>>> v = '1.2.3.4.5-RC4'
>>> v.split(".", 3)
['1', '2', '3', '4.5-RC4']
Once the third '.' is found, Python stops searching through the string. You can also neaten slightly by removing the default 0 argument to the slice:
api = '.'.join(version.split('.', 3)[:3])
Note, however, that any difference in performance is negligible:
>>> import timeit
>>> def test1(version):
return '.'.join(version.split('.')[0:3])
>>> def test2(version):
return '.'.join(version.split('.', 3)[:3])
>>> timeit.timeit("test1(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0458565345561743
>>> timeit.timeit("test2(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0842980287537776
The benefit of maxsplit becomes clearer with longer strings containing more irrelevant '.'s:
>>> timeit.timeit("s.split('.')", setup="s='1.'*100")
3.460900054011617
>>> timeit.timeit("s.split('.', 3)", setup="s='1.'*100")
0.5287887450379003

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
A feel of CPU wasting is absolutely normal for C/C++ programmers facing python code. Your code:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join(version.split('.')[0:3]) # extract '1.2.3'
Is absolutely fine in python, there is no simplification possible. Only if you have to do it 1000s of times, consider using a library function or write your own.

How to read file capabilities using Python?

On Linux systems root privileges can be granted more selectively than adding the setuid bit using file capabilities. See capabilities(7) for details. These are attributes of files and can be read using the getcap program. How can these attributes be retrieved in Python?
Even though running the getcap program using e.g. subprocess for answering such a question is possible it is not desirable when retrieving very many capabilities.
It should be possible to devise a solution using ctypes. Are there alternatives to this approach or even libraries facilitating this task?

Python 3.3 comes with os.getxattr. If not, yeah... one way would be using ctypes, at least to get the raw stuff, or maybe use pyxattr
For pyxattr:
>>> import xattr
>>> xattr.listxattr("/bin/ping")
(u'security.capability',)
>>> xattr.getxattr("/bin/ping", "security.capability")
'\x00\x00\x00\x02\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
For Python 3.3's version, it's essentially the same, just importing os, instead of xattr. ctypes is a bit more involved, though.
Now, we're getting the raw result, meaning that those two are most useful only retrieving textual attributes. But... we can use the same approach of getcap, through libcap itself:
import ctypes
libcap = ctypes.cdll.LoadLibrary("libcap.so")
cap_t = libcap.cap_get_file('/bin/ping')
libcap.cap_to_text.restype = ctypes.c_char_p
libcap.cap_to_text(cap_t, None)
which gives me:
'= cap_net_raw+p'
probably more useful for you.
PS: note that cap_to_text returns a malloced string. It's your job to deallocate it using cap_free
Hint about the "binary gibberish":
>>> import struct
>>> caps = '\x00\x00\x00\x02\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> struct.unpack("<IIIII", caps)
(33554432, 8192, 0, 0, 0)
In that 8192, the only active bit is the 13th. If you go to linux/capability.h, you'll see that CAP_NET_RAW is defined at 13.
Now, if you wan to write a module with all those constants, you can decode the info. But I'd say it's much more laborious than just using ctypes + libcap.

I tried the code from Ricardo Cárdenes's answer, but it did not work properly for me, because some details of the ctypes invocation incorrect. This issue caused a truncated path string to be passed to getxattr(...) inside of libcap, which thus returned the wrong capabilities list for the wrong item (the / directory, or other first path character, and not the actual path).
It is very important to remember and account for the difference between str and bytes in Python 3.X. This code works properly on Python 3.5/3.6:
#!/usr/bin/env python3
import ctypes
import os
import sys
# load shared library
libcap = ctypes.cdll.LoadLibrary('libcap.so')
class libcap_auto_c_char_p(ctypes.c_char_p):
def __del__(self):
libcap.cap_free(self)
# cap_t cap_get_file(const char *path_p)
libcap.cap_get_file.argtypes = [ctypes.c_char_p]
libcap.cap_get_file.restype = ctypes.c_void_p
# char* cap_to_text(cap_t caps, ssize_t *length_p)
libcap.cap_to_text.argtypes = [ctypes.c_void_p, ctypes.c_void_p]
libcap.cap_to_text.restype = libcap_auto_c_char_p
def cap_get_file(path):
cap_t = libcap.cap_get_file(path.encode('utf-8'))
if cap_t is None:
return ''
else:
return libcap.cap_to_text(cap_t, None).value.decode('utf-8')
print(cap_get_file('/usr/bin/traceroute6.iputils'))
print(cap_get_file('/usr/bin/systemd-detect-virt'))
print(cap_get_file('/usr/bin/mtr'))
print(cap_get_file('/usr/bin/tar'))
print(cap_get_file('/usr/bin/bogus'))
The output will look like this (anything nonexistent, or with no capabilities set just returns '':
= cap_net_raw+ep
= cap_dac_override,cap_sys_ptrace+ep
= cap_net_raw+ep

Python object persistence

I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process.
I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws:
anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)
shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.
pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.
My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B):
A>>> import imaginary_perfect_module as mod
B>>> import imaginary_perfect_module as mod
A>>> d = mod.load('a_file')
B>>> d = mod.load('a_file')
A>>> d
{}
B>>> d
{}
A>>> d[1] = 'this string is one'
A>>> d['ones'] = 1 #anydbm would sulk here
A>>> d['ones'] = 11
A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'}
B>>> d['ones'] #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d
11 #pickle (with different syntax) would have returned 1 here, and then 11 on next call
(etc. for B)
I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.

Use the ZODB (the Zope Object Database) instead. Backed with ZEO it fulfills your requirements:
Transparent persistence for Python objects
ZODB uses pickles underneath so anything that is pickle-able can be stored in a ZODB object store.
Full ACID-compatible transaction support (including savepoints)
This means changes from one process propagate to all the other processes when they are good and ready, and each process has a consistent view on the data throughout a transaction.
ZODB has been around for over a decade now, so you are right in surmising this problem has already been solved before. :-)
The ZODB let's you plug in storages; the most common format is the FileStorage, which stores everything in one Data.fs with an optional blob storage for large objects.
Some ZODB storages are wrappers around others to add functionality; DemoStorage for example keeps changes in memory to facilitate unit testing and demonstration setups (restart and you have clean slate again). BeforeStorage gives you a window in time, only returning data from transactions before a given point in time. The latter has been instrumental in recovering lost data for me.
ZEO is such a plugin that introduces a client-server architecture. Using ZEO lets you access a given storage from multiple processes at a time; you won't need this layer if all you need is multi-threaded access from one process only.
The same could be achieved with RelStorage, which stores ZODB data in a relational database such as PostgreSQL, MySQL or Oracle.

For beginners, You can port your shelve databases to ZODB databases like this:
#!/usr/bin/env python
import shelve
import ZODB, ZODB.FileStorage
import transaction
from optparse import OptionParser
import os
import sys
import re
reload(sys)
sys.setdefaultencoding("utf-8")
parser = OptionParser()
parser.add_option("-o", "--output", dest = "out_file", default = False, help ="original shelve database filename")
parser.add_option("-i", "--input", dest = "in_file", default = False, help ="new zodb database filename")
parser.set_defaults()
options, args = parser.parse_args()
if options.in_file == False or options.out_file == False :
print "Need input and output database filenames"
exit(1)
db = shelve.open(options.in_file, writeback=True)
zstorage = ZODB.FileStorage.FileStorage(options.out_file)
zdb = ZODB.DB(zstorage)
zconnection = zdb.open()
newdb = zconnection.root()
for key, value in db.iteritems() :
print "Copying key: " + str(key)
newdb[key] = value
transaction.commit()

I suggest using TinyDB, it's much much better and simple to use.
https://tinydb.readthedocs.io/en/stable/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.