My original issue is that I am trying to do the following:
def submit_decoder_process(decoder, input_line):
decoder.process_line(input_line)
return decoder
self.pool = Pool(processes=num_of_processes)
self.pool.apply_async(submit_decoder_process, [decoder, input_line]).get()
decoder is a bit involved to describe here, but the important thing is that decoder is an object that is initialized with PyParsing expression that calls setParseAction(). This fails pickle that multiprocessing uses and this in turn fails the above code.
Now, here is the pickle/PyParsing problem that I have isolated and simplified.
The following code yields an error message due to pickle failure.
import pickle
from pyparsing import *
def my_pa_func():
pass
pickle.dumps(Word(nums).setParseAction(my_pa_func))
Error message:
pickle.PicklingError: Can't pickle <function wrapper at 0x00000000026534A8>: it's not found as pyparsing.wrapper
Now If you remove the call .setParseAction(my_pa_func), it will work with no problems:
pickle.dumps(Word(nums))
How can I get around it? Multiprocesing uses pickle, so I can't avoid it, I guess. The pathos package that is supposedly uses dill is not mature enough, at least, I am having problems installing it on my Windows-64bit. I am really scratching my head here.
OK, here is the solution inspired by rocksportrocker: Python multiprocessing pickling error
The idea is to dill the object that can't be pickled while passing it back and forth between processes and then "undill" it after it has been passed:
from multiprocessing import Pool
import dill
def submit_decoder_process(decoder_dill, input_line):
decoder = dill.loads(decoder_dill) # undill after it was passed to a pool process
decoder.process_line(input_line)
return dill.dumps(decoder) # dill before passing back to parent process
self.pool = Pool(processes=num_of_processes)
# Dill before sending to a pool process
decoder_processed = dill.loads(self.pool.apply_async(submit_decoder_process, [dill.dumps(decoder), input_line]).get())
https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The multiprocessing.Pool uses the Pickle's protocol to serialize the function and module names (in your example setParseAction and pyparse) which are delivered through the Pipe to the child process.
The child process, once receives them, it imports the module and try to call the function. The problem is that what you're passing is not a function but a method. To resolve it, the Pickle protocol should be clever enough to build 'Word' object with the 'user' parameter and then call the setParseAction method. As handling these cases is too complicated, the Pickle protocol prevents you to serialize non top level functions.
To solve your issue either you instruct the Pickle's module on how to serialize the setParseAction method (https://docs.python.org/2/library/pickle.html#pickle-protocol) or you refactor your code in a way that what's passed to the Pool.apply_async is serializable.
What if you pass the Word object to the child process and you let it call the Word().setParseAction()?
I'd suggest pathos.multiprocessing, as you mention. Of course, I'm the pathos author, so I guess that's not a surprise. It appears that there might be a distutils bug that you are running into, as referenced here: https://github.com/uqfoundation/pathos/issues/49.
Your solution using dill is a good workaround. You also might be able to forgo installing the entire pathos package, and just install the pathos fork of the multiprocessing package (which uses dill instead of pickle). You can find it here: http://dev.danse.us/packages or here: https://github.com/uqfoundation/pathos/tree/master/external,
Related
Is there any way to test a pickle file to see if it loads a function or class during unpickling?
This gives a good summary of how to stop loading of selected functions:
https://docs.python.org/3/library/pickle.html#restricting-globals
I assume it could be used to check if there is function loading at all, by simply blocking all function loading and getting an error message.
But is there a way to write a function that will simply say: there is only text data in this pickled object and no function loading?
I can't say I know which builtins are safe!
Basically no, there is truly no way. There is a lot written about this. You can only use pickle if you trust the source, and you get the pickle directly from the source.
Any safety measures you perform are not sufficiënt to protect against mallicious attempts whatsoever.
https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9
https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html
etcetera.
I use it sometimes, but then most of the times after I have had a phonecall with a colleague and shared a pickled file. But more often, I use it for myself on my local environment to store data. Still, this is not the preferred way, but it's fast.
So, when in doubt. Do not use pickle.
Thanks for the answers!
Straight pickle is too prone to security issues.
Picklemagic claims to fix the security issues; it looks like it does, but I can't quite confirm that: http://github.com/CensoredUsername/picklemagic
But https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9 suggest that there is no safe wrapper (picklemagic has been around for 8 years, and the article dates from 2021; so picklemagic was not considered?)
The only surefire way protect against Pickle Bombs is not to use pickle directly. Unfortunately unlike other unsafe standard library packages there are no safe wrappers or drop-in alternatives available for pickle, like defusedxml for xml or tarsafe for tarfile. Further there’s not a great way to inspect a pickle prior to unpickling or to block unsafe function calls invoked by REDUCE.
The 3.10 docs does offer a wrapper for blocking unauthorized execution of a function. https://docs.python.org/3.10/tutorial/controlflow.html
It does not say which builtins are safe. If os is removed, are the others safe? Still, if it is clear what is supposed to be in the pickled object, it may be easy enough to restrict execution.
import builtins
import io
import pickle
safe_builtins = {
'range',
'complex',
'set',
'frozenset',
'slice',
}
class RestrictedUnpickler(pickle.Unpickler):
def find_class(self, module, name):
# Only allow safe classes from builtins.
if module == "builtins" and name in safe_builtins:
return getattr(builtins, name)
# Forbid everything else.
raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
(module, name))
def restricted_loads(s):
"""Helper function analogous to pickle.loads()."""
return RestrictedUnpickler(io.BytesIO(s)).load()
I have a multiprocessing program on Device A which uses a queue and a SyncManager to make this accessible over the network. The queue stores a custom class from a module on the device which gets automatically pickled by the multiprocessing package as module.class.
On another device reading the queue via a SyncManager, I have the same module as part of a package instead of top-level as it was on Device A. This means I get a ModuleNotFoundError when I attempt to read an item from the queue as the unpickler doesn't know the module is now package.module.
I've seen this work-around which uses a new class based on pickler.Unpicker and seems the least hacky and extensible: https://stackoverflow.com/a/53327348/5683049
However, I don't know how to specify the multiprocessing unpickler class to use.
I see this can be done for the reducer class so I assume there is a way to also set the unpickler?
I have never seen a way to do this. You may have to hack around this. Let the multiprocessor system think you're passing byte strings or byte arrays, and have your user code perform the pickling and unpickling.
A hack? Yes. But not much worse that what you already have to do.
Using a mixture of:
How to change the serialization method used by the multiprocessing module?
https://stackoverflow.com/a/53327348/5683049
I was able to get this working using code similar to the following:
from multiprocessing.reduction import ForkingPickler, AbstractReducer
import pickle
import io
multiprocessing.context._default_context.reducer = MyPickleReducer()
class RenameUnpickler(pickle.Unpickler):
def find_class(self, module, name):
renamed_module = module
if module == "old_module_name":
renamed_module = "new_package.module_name"
return super(RenameUnpickler, self).find_class(renamed_module, name)
class MyForkingPickler(ForkingPickler):
# Method signature from pickle._loads
def loads(self, /, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None):
if isinstance(s, str):
raise TypeError("Can't load pickle from unicode string")
file = io.BytesIO(s)
return RenameUnpickler(file, fix_imports=fix_imports, buffers=buffers,
encoding=encoding, errors=errors).load()
class MyPickleReducer(AbstractReducer):
ForkingPickler = MyForkingPickler
register = MyForkingPickler.register
This could be useful if you want to further override how the unpickling is performed, but in my original case it is probably just easier to redirect the module using:
from new_package import module_name
sys.modules['old_module_name'] = module_name
I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue
import socket
import multiprocessing as mp
import types
import copy_reg
import pickle
def _pickle_method(method):
func_name = method.im_func.__name__
obj = method.im_self
cls = method.im_class
return _unpickle_method, (func_name, obj, cls)
def _unpickle_method(func_name, obj, cls):
for cls in cls.mro():
try:
func = cls.__dict__[func_name]
except KeyError:
pass
else:
break
return func.__get__(obj, cls)
copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)
class a(object):
def __init__(self):
sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock1.connect((socket.gethostbyname("example.com"), 6667))
self.servers = {}
self.servers["example.com"] = sock1
def method(self, hostname):
self.servers[hostname].send("JOIN DAN\r\n")
print "1"
def oth_method(self):
pool = mp.Pool()
## pickle.dumps(self.method)
pool.map(self.method, self.servers.keys())
pool.close()
pool.join()
if __name__ == "__main__":
b = a()
b.oth_method()
Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error
TypeError: expected string or Unicode object, NoneType found
From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.
I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.
Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).
Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.
I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.
Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.
In more detail:
Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.
You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…
But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.
One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.
A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.
You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.
If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.
If you do want to try jumping out of the standard library, then the following code for pathos.multiprocessing (as you mention) should not throw pickling errors, as the dill serializer knows how to serialize sockets and file handles.
>>> import socket
>>> import pathos.multiprocessing as mp
>>> import types
>>> import dill as pickle
>>>
>>> class a(object):
... def __init__(self):
... sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
... sock1.connect((socket.gethostbyname("example.com"), 6667))
... self.servers = {}
... self.servers["example.com"] = sock1
... def method(self, hostname):
... self.servers[hostname].send("JOIN DAN\r\n")
... print "1"
... def oth_method(self):
... pool = mp.ProcessingPool()
... pool.map(self.method, self.servers.keys())
... pool.close()
... pool.join()
...
>>> b = a()
>>> b.oth_method()
One issue however is that you need serialization with multiprocessing, and in many cases the sockets will serialize so that the deserialized socket is closed. The reason is primarily because the file descriptor isn't copied as expected, it's copied by reference. With dill you can customize the serialization of file handles, so that the content does get transferred as opposed to using a reference… however, this doesn't translate well for a socket (at least at the moment).
I'm the dill and pathos author, and I'd have to agree with #abarnert that you probably don't want to do this with multiprocessing (at least not storing a map of servers and sockets). If you want to use multiprocessing's threading interface, and you find you run into any serialization concerns, pathos.multiprocessing does have mp.ThreadingPool() instead of mp.ProcessingPool(), so that you can access a wrapper around multiprocessing.dummy.Pool, but still get the additional features that pathos provides (such as multi-argument pools for blocking or asynchronous pipes and maps, etc).
I have written a Python interface to a process-centric job distribution system we're developing/using internally at my workplace. While reasonably skilled programmers, the primary people using this interface are research scientists, not software developers, so ease-of-use and keeping the interface out of the way to the greatest degree possible is paramount.
My library unrolls a sequence of inputs into a sequence of pickle files on a shared file server, then spawns jobs that load those inputs, perform the computation, pickle the results, and exit; the client script then picks back up and produces a generator that loads and yields the results (or rethrows any exception the calculation function did.)
This is only useful since the calculation function itself is one of the serialized inputs. cPickle is quite content to pickle function references, but requires the pickled function to be reimportable in the same context. This is problematic. I've already solved the problem of finding the module to reimport it, but the vast majority of the time, it is a top-level function that is pickled and, thus, does not have a module path. The only strategy I've found to be able to unpickle such a function on the computation nodes is this nauseating little approach towards simulating the original environment in which the function was pickled before unpickling it:
...
# At this point, we've identified the source of the target function.
# A string by its name lives in "modname".
# In the real code, there is significant try/except work here.
targetModule = __import__(modname)
globalRef = globals()
for thingie in dir(targetModule):
if thingie not in globalRef:
globalRef[thingie] = targetModule.__dict__[thingie]
# sys.argv[2]: the path to the pickle file common to all jobs, which contains
# any data in common to all invocations of the target function, then the
# target function itself
commonFile = open(sys.argv[2], "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
# the actual function unpack I'm having trouble with:
doIt = commonUnpickle.load()
The final line is the most important one here- it's where my module is picking up the function it should actually be running. This code, as written, works as desired, but directly manipulating the symbol tables like this is unsettling.
How can I do this, or something very much like this that does not force the research scientists to separate their calculation scripts into a proper class structure (they use Python like the most excellent graphing calculator ever and I would like to continue to let them do so) the way Pickle desperately wants, without the unpleasant, unsafe, and just plain scary __dict__-and-globals() manipulation I'm using above? I fervently believe there has to be a better way, but exec "from {0} import *".format("modname") didn't do it, several attempts to inject the pickle load into the targetModule reference didn't do it, and eval("commonUnpickle.load()", targetModule.__dict__, locals()) didn't do it. All of these fail with Unpickle's AttributeError over being unable to find the function in <module>.
What is a better way?
Pickling functions can be rather annoying if trying to move them into a different context. If the function does not reference anything from the module that it is in and references (if anything) modules that are guaranteed to be imported, you might check some code from a Rudimentary Database Engine found on the Python Cookbook.
In order to support views, the academic module grabs the code from the callable when pickling the query. When it comes time to unpickle the view, a LambdaType instance is created with the code object and a reference to a namespace containing all imported modules. The solution has limitations but worked well enough for the exercise.
Example for Views
class _View:
def __init__(self, database, query, *name_changes):
"Initializes _View instance with details of saved query."
self.__database = database
self.__query = query
self.__name_changes = name_changes
def __getstate__(self):
"Returns everything needed to pickle _View instance."
return self.__database, self.__query.__code__, self.__name_changes
def __setstate__(self, state):
"Sets the state of the _View instance when unpickled."
database, query, name_changes = state
self.__database = database
self.__query = types.LambdaType(query, sys.modules)
self.__name_changes = name_changes
Sometimes is appears necessary to make modifications to the registered modules available in the system. If for example you need to make reference to the first module (__main__), you may need to create a new module with your available namespace loaded into a new module object. The same recipe used the following technique.
Example for Modules
def test_northwind():
"Loads and runs some test on the sample Northwind database."
import os, imp
# Patch the module namespace to recognize this file.
name = os.path.splitext(os.path.basename(sys.argv[0]))[0]
module = imp.new_module(name)
vars(module).update(globals())
sys.modules[name] = module
Your question was long, and I was too caffeinated to make it through your very long question… However, I think you are looking to do something that there's a pretty good existing solution for already. There's a fork of the parallel python (i.e. pp) library that takes functions and objects and serializes them, sends them to different servers, and then unpikles and executes them. The fork lives inside the pathos package, but you can download it independently here:
http://danse.cacr.caltech.edu/packages/dev_danse_us
The "other context" in that case is another server… and the objects are transported by converting the objects to source code and then back to objects.
If you are looking to use pickling, much in the way you are doing already, there's an extension to mpi4py that serializes arguments and functions, and returns pickled return values… The package is called pyina, and is commonly used to ship code and objects to cluster nodes in coordination with a cluster scheduler.
Both pathos and pyina provide map abstractions (and pipe), and try to hide all of the details of parallel computing behind the abstractions, so scientists don't need to learn anything except how to program normal serial python. They just use one of the map or pipe functions, and get parallel or distributed computing.
Oh, I almost forgot. The dill serializer includes dump_session and load_session functions that allow the user to easily serialize their entire interpreter session and send it to another computer (or just save it for later use). That's pretty handy for changing contexts, in a different sense.
Get dill, pathos, and pyina here: https://github.com/uqfoundation
For a module to be recognized as loaded I think it must by in sys.modules, not just its content imported into your global/local namespace. Try to exec everything, then get the result out of an artificial environment.
env = {"fn": sys.argv[2]}
code = """\
import %s # maybe more
import cPickle
commonFile = open(fn, "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
doIt = commonUnpickle.load()
"""
exec code in env
return env["doIt"]
While functions are advertised as first-class objects in Python, this is one case where it can be seen that they are really second-class objects. It is the reference to the callable, not the object itself, that is pickled. (You cannot directly pickle a lambda expression.)
There is an alternate usage of __import__ that you might prefer:
def importer(modulename, symbols=None):
u"importer('foo') returns module foo; importer('foo', ['bar']) returns {'bar': object}"
if modulename in sys.modules: module = sys.modules[modulename]
else: module = __import__(modulename, fromlist=['*'])
if symbols == None: return module
else: return dict(zip(symbols, map(partial(getattr, module), symbols)))
So these would all be basically equivalent:
from mymodule.mysubmodule import myfunction
myfunction = importer('mymodule.mysubmodule').myfunction
globals()['myfunction'] = importer('mymodule.mysubmodule', ['myfunction'])['myfunction']
(Keep in mind I'm working in Python 3, so a solution needs to work in Python 3.)
I would like to use the copyreg module to teach Python how to pickle functions. When I tried to do it, the _Pickler object would still try to pickle functions using the save_global function. (Which doesn't work for unbound methods, and that's the motivation for doing this.)
It seems like _Pickler first tries to look in its own dispatch for the type of the object that you want to pickle before looking in copyreg.dispatch_table. I'm not sure if this is intentional.
Is there any way for me to tell Python to pickle functions with the reducer that I provide?
The following hack seems to work in Python 3.1...:
import copyreg
def functionpickler(f):
print('pickling', f.__name__)
return f.__name__
ft = type(functionpickler)
copyreg.pickle(ft, functionpickler)
import pickle
pickle.Pickler = pickle._Pickler
del pickle.Pickler.dispatch[ft]
s = pickle.dumps(functionpickler)
print('Result is', s)
Out of this, the two hackish lines are:
pickle.Pickler = pickle._Pickler
del pickle.Pickler.dispatch[ft]
You need to remove the dispatch entry for functions' type because otherwise it preempts the copyreg registration; and I don't think you can do that on the C-coded Pickler so you need to set it to the Python-coded one.
It would be a bit less of a hack to subclass _Pickler with a class of your own which makes its own dispatch (copying the parent's and removing the entry for the function type), and then use your subclass specifically (and its dump method) rather than pickle.dump; however it would also be a bit less convenient that this monkeypatching of pickle itself.