Alternative to Python Pickle less sensitive to refactoring - python

This is probably a relatively common concern but I haven't found any real answer so far.
Let's use an example to illustrate why Pickle is not satisfying here: I have a file which contains a class that takes a long time (tens of hours) to instantiate. Once instantiated, I then save it to a pickle file:
import pickle
import time
from foo.bar import baz
class MyClass:
def __init__(self):
self.message = baz.very_long_computation()
obj = MyClass()
with open('./obj.pickle', 'wb') as f:
pickle.dump(obj)
From a different file in the same repository, I want to load the instantiated object and use it, without having to wait:
import pickle
with open('./obj.pickle', 'rb') as f:
obj = pickle.load(f)
print(obj.message)
What can happen sometimes is that I change e.g. the baz module's location in the folder structure (without necessarily modify its content), for instance I move it from foo.bar.baz to qux.bar.baz.
Now, if I try to load obj.pickle again, Python will complain with a ModuleNotFoundError: No module named foo.
It seems that whenever I make the slightest refactoring, I have to completely re-instantiate MyClass, involving this long waiting time. In practice this is terribly annoying, hence my two questions:
is there any alternative to Pickle that wouldn't force re-instantiating?
is there a way to avoid this problem while still using Pickle?
Note: I understand that in the above example, I would simply have to save self.message into a text file and call it a day. In reality, it would be complex to persist the object's content in such a manner.

Related

Change default multiprocessing unpickler class

I have a multiprocessing program on Device A which uses a queue and a SyncManager to make this accessible over the network. The queue stores a custom class from a module on the device which gets automatically pickled by the multiprocessing package as module.class.
On another device reading the queue via a SyncManager, I have the same module as part of a package instead of top-level as it was on Device A. This means I get a ModuleNotFoundError when I attempt to read an item from the queue as the unpickler doesn't know the module is now package.module.
I've seen this work-around which uses a new class based on pickler.Unpicker and seems the least hacky and extensible: https://stackoverflow.com/a/53327348/5683049
However, I don't know how to specify the multiprocessing unpickler class to use.
I see this can be done for the reducer class so I assume there is a way to also set the unpickler?
I have never seen a way to do this. You may have to hack around this. Let the multiprocessor system think you're passing byte strings or byte arrays, and have your user code perform the pickling and unpickling.
A hack? Yes. But not much worse that what you already have to do.
Using a mixture of:
How to change the serialization method used by the multiprocessing module?
https://stackoverflow.com/a/53327348/5683049
I was able to get this working using code similar to the following:
from multiprocessing.reduction import ForkingPickler, AbstractReducer
import pickle
import io
multiprocessing.context._default_context.reducer = MyPickleReducer()
class RenameUnpickler(pickle.Unpickler):
def find_class(self, module, name):
renamed_module = module
if module == "old_module_name":
renamed_module = "new_package.module_name"
return super(RenameUnpickler, self).find_class(renamed_module, name)
class MyForkingPickler(ForkingPickler):
# Method signature from pickle._loads
def loads(self, /, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None):
if isinstance(s, str):
raise TypeError("Can't load pickle from unicode string")
file = io.BytesIO(s)
return RenameUnpickler(file, fix_imports=fix_imports, buffers=buffers,
encoding=encoding, errors=errors).load()
class MyPickleReducer(AbstractReducer):
ForkingPickler = MyForkingPickler
register = MyForkingPickler.register
This could be useful if you want to further override how the unpickling is performed, but in my original case it is probably just easier to redirect the module using:
from new_package import module_name
sys.modules['old_module_name'] = module_name

How to save my class to load in another environment

I conceptually want to do something easy: save a python object that I can access from another (different) program later.
But the problem is that is has a wrapper around it (the f(x) below) that is not being referenced in new environments.
After spending a whole 12 hours, I feel even more confused than when I started. I think "pickle"-ing or "dill" etc... is what I am supposed to do. But I am running up against the pickling problem. But reading online is getting me no where. (btw, i tried shap.save but it is having the same realm of problems and uses pickle anyways).
import shap, pickle
model = ... (some tensorflow function)
def f(X):
...
return model.predict(...).flatten()
explainer = shap.KernelExplainer(f, X.iloc[:50, :])
with open(f"/tmp/{file}.pkl", 'wb') as fil:
# explainer.save(fil)
pickle.dump(explainer, fil)
This does not work because it "cannot find
attribute 'f'". These look like the most promising articles I could find but I could not implement for my scenario.
http://gael-varoquaux.info/programming/decoration-in-python-done-right-decorating-and-pickling.html
Unable to load files using pickle and multiple modules
https://github.com/slundberg/shap/issues/295
Python: Can't pickle type X, attribute lookup failed
*** please provide suggestions on terminology in the comments for me to improve how I ask question because I do not know how to word my prompt.
Usually, in any of programming language, you don't save the object but instead, you save the attributes of an object into a file. Then you should make a function to read the file and fill the value into your object.
You can make your object into an installable package and import the package as an object into your program.
import pickle
class A(Object):
def __init__(self, var1, var2):
self.var1 = var1
sefl.var2 = var2
def dump_obj(self, fn):
pickle.dump(self, fn, pickle.HIGHEST_PROTOCOL)
in a file that you want to load your object and feed the data into your object
import pickle
import file_with_object_A
def load_A(fn):
with open(fn, 'rb') as file:
new_a = pickle.load(fn)
# or alternatively, you can load each attribute into a variable and feed it to your object
var1, var2 = file.readline(fn)
my_new_A = A(var1, var2)
Remember to import the file that you detail the function f(X) into your new code file, either in a way of the package or keep this file contains f(X) in the same directory with the current working file.
You can check out this answer Saving an Object (Data persistence)

More on python ImportError No module named

Following the suggestion here, my package (or the directory containing my modules) is located at C:/Python34/Lib/site-packages. The directory contains an __init__.py and sys.path contains a path to the directory as shown.
Still I am getting the following error:
Traceback (most recent call last):
File "C:/Python34/Lib/site-packages/toolkit/window.py", line 6, in <module>
from catalogmaker import Catalog
File "C:\Python34\Lib\site-packages\toolkit\catalogmaker.py", line 1, in <module>
from patronmaker import Patron
File "C:\Python34\Lib\site-packages\toolkit\patronmaker.py", line 4, in <module>
class Patron:
File "C:\Python34\Lib\site-packages\toolkit\patronmaker.py", line 11, in Patron
patrons = pickle.load(f)
ImportError: No module named 'Patron'
I have a class in patronmaker.py named 'Patron' but no module named Patron so I am not sure what the last statement in the error message means. I very much appreciate your thoughts on what I am missing.
Python Version 3.4.1 on a Windows 32 bits machine.
You are saving all patron instances (i.e. self) to the Patron class attribute Patron.patrons. Then you are trying to pickle a class attribute from within the class. This can choke pickle, however I believe dill should be able to handle it. Is it really necessary to save all the class instances to a list in Patrons? It's a bit of an odd thing to do…
pickle serializes classes by reference, and doesn't play well with __main__ for many objects. In dill, you don't have to serialize classes by reference, and it can handle issues with __main__, much better. Get dill here: https://github.com/uqfoundation
Edit:
I tried your code (with one minor change) and it worked.
dude#hilbert>$ python patronmaker.py
Then start python…
>>> import dill
>>> f = open('patrons.pkl', 'rb')
>>> p = dill.load(f)
>>> p
[Julius Caeser, Kunte Kinta, Norton Henrich, Mother Teresa]
The only change I made was to uncomment the lines at the end of patronmaker.py so that it saved some patrons…. and I also replaced import pickle with import dill as pickle everywhere.
So, even by downloading and running your code, I can't produce an error with dill. I'm using the latest dill from github.
Additional Edit:
Your traceback above is from an ImportError. Did you install your module? If you didn't use setup.py to install it, or if you don't have your module on your PYTHONPATH, then you won't find your module regardless of how you are serializing things.
Even more edits:
Looking at your code, you should be using the singleton pattern for patrons… it should not be inside the class Patron. The block of code at the class level to load the patrons into Patron.patrons is sure to cause problems… and probably bound to be the source of some form of errors. I also see that you are pickling the attribute Patrons.patrons (not even the class itself) from inside the Patrons class -- this is madness -- don't do it. Also notice that when you are trying to obtain the patrons, you use Patron.patrons… this is calling the class object and not an instance. Move patrons outside of the class, and use the singleton directly as a list of patrons. Also you should typically be using the patrons instance, so if you wanted to have each patron know who all the other patrons are, p = Patron('Joe', 'Blow'), then p.patrons to get all patrons… but you'd need to write a Patrons.load method that reads the singleton list of patrons… you could also use a property to make the load give you something that looks like an attribute.
If you build a singleton of patrons (as a list)… or a "registry" of patrons (as a dict) if you like, then just check if a patrons pickle file exists… to load to the registry… and don't do it from inside the Patrons class… things should go much better. Your code currently is trying to load a class instance on a class definition while it builds that class object. That's bad...
Also, don't expect people to go downloading your code and debugging it for you, when you don't present a minimal test case or sufficient info for how the traceback was created.
You may have hit on a valid pickling error in dill for some dark corner case, but I can't tell b/c I can't reproduce your error. However, I can tell that you need some refactoring.
And just to be explicit:
Move your patrons initializing mess from Patrons into a new file patrons.py
import os
import dill as pickle
#Initialize patrons with saved pickle data
if os.path.isfile('patrons.pkl'):
with open("patrons.pkl", 'rb') as f:
patrons = pickle.load(f)
else: patrons = []
Then in patronmaker.py, and everywhere else you need the singleton…
import dill as pickle
import os.path
import patrons as the
class Patron:
def __init__(self, lname, fname):
self.lname = lname.title()
self.fname = fname.title()
self.terrCheckedOutHistory = {}
#Add any created Patron to patrons list
the.patrons.append(self)
#Preserve this person via pickle
with open('patrons.pkl', 'wb') as f:
pickle.dump(the.patrons, f)
And you should be fine unless your code is hitting one of the cases that attributes on modules can't be serialized because they were added dynamically (see https://github.com/uqfoundation/dill/pull/47), which should definitely make pickle fail, and in some cases dill too… probably with an AtrributeError on the module. I just can't reproduce this… and I'm done.

Unpickling a function into a different context in Python

I have written a Python interface to a process-centric job distribution system we're developing/using internally at my workplace. While reasonably skilled programmers, the primary people using this interface are research scientists, not software developers, so ease-of-use and keeping the interface out of the way to the greatest degree possible is paramount.
My library unrolls a sequence of inputs into a sequence of pickle files on a shared file server, then spawns jobs that load those inputs, perform the computation, pickle the results, and exit; the client script then picks back up and produces a generator that loads and yields the results (or rethrows any exception the calculation function did.)
This is only useful since the calculation function itself is one of the serialized inputs. cPickle is quite content to pickle function references, but requires the pickled function to be reimportable in the same context. This is problematic. I've already solved the problem of finding the module to reimport it, but the vast majority of the time, it is a top-level function that is pickled and, thus, does not have a module path. The only strategy I've found to be able to unpickle such a function on the computation nodes is this nauseating little approach towards simulating the original environment in which the function was pickled before unpickling it:
...
# At this point, we've identified the source of the target function.
# A string by its name lives in "modname".
# In the real code, there is significant try/except work here.
targetModule = __import__(modname)
globalRef = globals()
for thingie in dir(targetModule):
if thingie not in globalRef:
globalRef[thingie] = targetModule.__dict__[thingie]
# sys.argv[2]: the path to the pickle file common to all jobs, which contains
# any data in common to all invocations of the target function, then the
# target function itself
commonFile = open(sys.argv[2], "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
# the actual function unpack I'm having trouble with:
doIt = commonUnpickle.load()
The final line is the most important one here- it's where my module is picking up the function it should actually be running. This code, as written, works as desired, but directly manipulating the symbol tables like this is unsettling.
How can I do this, or something very much like this that does not force the research scientists to separate their calculation scripts into a proper class structure (they use Python like the most excellent graphing calculator ever and I would like to continue to let them do so) the way Pickle desperately wants, without the unpleasant, unsafe, and just plain scary __dict__-and-globals() manipulation I'm using above? I fervently believe there has to be a better way, but exec "from {0} import *".format("modname") didn't do it, several attempts to inject the pickle load into the targetModule reference didn't do it, and eval("commonUnpickle.load()", targetModule.__dict__, locals()) didn't do it. All of these fail with Unpickle's AttributeError over being unable to find the function in <module>.
What is a better way?
Pickling functions can be rather annoying if trying to move them into a different context. If the function does not reference anything from the module that it is in and references (if anything) modules that are guaranteed to be imported, you might check some code from a Rudimentary Database Engine found on the Python Cookbook.
In order to support views, the academic module grabs the code from the callable when pickling the query. When it comes time to unpickle the view, a LambdaType instance is created with the code object and a reference to a namespace containing all imported modules. The solution has limitations but worked well enough for the exercise.
Example for Views
class _View:
def __init__(self, database, query, *name_changes):
"Initializes _View instance with details of saved query."
self.__database = database
self.__query = query
self.__name_changes = name_changes
def __getstate__(self):
"Returns everything needed to pickle _View instance."
return self.__database, self.__query.__code__, self.__name_changes
def __setstate__(self, state):
"Sets the state of the _View instance when unpickled."
database, query, name_changes = state
self.__database = database
self.__query = types.LambdaType(query, sys.modules)
self.__name_changes = name_changes
Sometimes is appears necessary to make modifications to the registered modules available in the system. If for example you need to make reference to the first module (__main__), you may need to create a new module with your available namespace loaded into a new module object. The same recipe used the following technique.
Example for Modules
def test_northwind():
"Loads and runs some test on the sample Northwind database."
import os, imp
# Patch the module namespace to recognize this file.
name = os.path.splitext(os.path.basename(sys.argv[0]))[0]
module = imp.new_module(name)
vars(module).update(globals())
sys.modules[name] = module
Your question was long, and I was too caffeinated to make it through your very long question… However, I think you are looking to do something that there's a pretty good existing solution for already. There's a fork of the parallel python (i.e. pp) library that takes functions and objects and serializes them, sends them to different servers, and then unpikles and executes them. The fork lives inside the pathos package, but you can download it independently here:
http://danse.cacr.caltech.edu/packages/dev_danse_us
The "other context" in that case is another server… and the objects are transported by converting the objects to source code and then back to objects.
If you are looking to use pickling, much in the way you are doing already, there's an extension to mpi4py that serializes arguments and functions, and returns pickled return values… The package is called pyina, and is commonly used to ship code and objects to cluster nodes in coordination with a cluster scheduler.
Both pathos and pyina provide map abstractions (and pipe), and try to hide all of the details of parallel computing behind the abstractions, so scientists don't need to learn anything except how to program normal serial python. They just use one of the map or pipe functions, and get parallel or distributed computing.
Oh, I almost forgot. The dill serializer includes dump_session and load_session functions that allow the user to easily serialize their entire interpreter session and send it to another computer (or just save it for later use). That's pretty handy for changing contexts, in a different sense.
Get dill, pathos, and pyina here: https://github.com/uqfoundation
For a module to be recognized as loaded I think it must by in sys.modules, not just its content imported into your global/local namespace. Try to exec everything, then get the result out of an artificial environment.
env = {"fn": sys.argv[2]}
code = """\
import %s # maybe more
import cPickle
commonFile = open(fn, "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
doIt = commonUnpickle.load()
"""
exec code in env
return env["doIt"]
While functions are advertised as first-class objects in Python, this is one case where it can be seen that they are really second-class objects. It is the reference to the callable, not the object itself, that is pickled. (You cannot directly pickle a lambda expression.)
There is an alternate usage of __import__ that you might prefer:
def importer(modulename, symbols=None):
u"importer('foo') returns module foo; importer('foo', ['bar']) returns {'bar': object}"
if modulename in sys.modules: module = sys.modules[modulename]
else: module = __import__(modulename, fromlist=['*'])
if symbols == None: return module
else: return dict(zip(symbols, map(partial(getattr, module), symbols)))
So these would all be basically equivalent:
from mymodule.mysubmodule import myfunction
myfunction = importer('mymodule.mysubmodule').myfunction
globals()['myfunction'] = importer('mymodule.mysubmodule', ['myfunction'])['myfunction']

Python: Using `copyreg` to define reducers for types that already have reducers

(Keep in mind I'm working in Python 3, so a solution needs to work in Python 3.)
I would like to use the copyreg module to teach Python how to pickle functions. When I tried to do it, the _Pickler object would still try to pickle functions using the save_global function. (Which doesn't work for unbound methods, and that's the motivation for doing this.)
It seems like _Pickler first tries to look in its own dispatch for the type of the object that you want to pickle before looking in copyreg.dispatch_table. I'm not sure if this is intentional.
Is there any way for me to tell Python to pickle functions with the reducer that I provide?
The following hack seems to work in Python 3.1...:
import copyreg
def functionpickler(f):
print('pickling', f.__name__)
return f.__name__
ft = type(functionpickler)
copyreg.pickle(ft, functionpickler)
import pickle
pickle.Pickler = pickle._Pickler
del pickle.Pickler.dispatch[ft]
s = pickle.dumps(functionpickler)
print('Result is', s)
Out of this, the two hackish lines are:
pickle.Pickler = pickle._Pickler
del pickle.Pickler.dispatch[ft]
You need to remove the dispatch entry for functions' type because otherwise it preempts the copyreg registration; and I don't think you can do that on the C-coded Pickler so you need to set it to the Python-coded one.
It would be a bit less of a hack to subclass _Pickler with a class of your own which makes its own dispatch (copying the parent's and removing the entry for the function type), and then use your subclass specifically (and its dump method) rather than pickle.dump; however it would also be a bit less convenient that this monkeypatching of pickle itself.

Categories

Resources