I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue
import socket
import multiprocessing as mp
import types
import copy_reg
import pickle
def _pickle_method(method):
func_name = method.im_func.__name__
obj = method.im_self
cls = method.im_class
return _unpickle_method, (func_name, obj, cls)
def _unpickle_method(func_name, obj, cls):
for cls in cls.mro():
try:
func = cls.__dict__[func_name]
except KeyError:
pass
else:
break
return func.__get__(obj, cls)
copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)
class a(object):
def __init__(self):
sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock1.connect((socket.gethostbyname("example.com"), 6667))
self.servers = {}
self.servers["example.com"] = sock1
def method(self, hostname):
self.servers[hostname].send("JOIN DAN\r\n")
print "1"
def oth_method(self):
pool = mp.Pool()
## pickle.dumps(self.method)
pool.map(self.method, self.servers.keys())
pool.close()
pool.join()
if __name__ == "__main__":
b = a()
b.oth_method()
Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error
TypeError: expected string or Unicode object, NoneType found
From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.
I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.
Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).
Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.
I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.
Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.
In more detail:
Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.
You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…
But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.
One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.
A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.
You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.
If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.
If you do want to try jumping out of the standard library, then the following code for pathos.multiprocessing (as you mention) should not throw pickling errors, as the dill serializer knows how to serialize sockets and file handles.
>>> import socket
>>> import pathos.multiprocessing as mp
>>> import types
>>> import dill as pickle
>>>
>>> class a(object):
... def __init__(self):
... sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
... sock1.connect((socket.gethostbyname("example.com"), 6667))
... self.servers = {}
... self.servers["example.com"] = sock1
... def method(self, hostname):
... self.servers[hostname].send("JOIN DAN\r\n")
... print "1"
... def oth_method(self):
... pool = mp.ProcessingPool()
... pool.map(self.method, self.servers.keys())
... pool.close()
... pool.join()
...
>>> b = a()
>>> b.oth_method()
One issue however is that you need serialization with multiprocessing, and in many cases the sockets will serialize so that the deserialized socket is closed. The reason is primarily because the file descriptor isn't copied as expected, it's copied by reference. With dill you can customize the serialization of file handles, so that the content does get transferred as opposed to using a reference… however, this doesn't translate well for a socket (at least at the moment).
I'm the dill and pathos author, and I'd have to agree with #abarnert that you probably don't want to do this with multiprocessing (at least not storing a map of servers and sockets). If you want to use multiprocessing's threading interface, and you find you run into any serialization concerns, pathos.multiprocessing does have mp.ThreadingPool() instead of mp.ProcessingPool(), so that you can access a wrapper around multiprocessing.dummy.Pool, but still get the additional features that pathos provides (such as multi-argument pools for blocking or asynchronous pipes and maps, etc).
Related
Is there any way to test a pickle file to see if it loads a function or class during unpickling?
This gives a good summary of how to stop loading of selected functions:
https://docs.python.org/3/library/pickle.html#restricting-globals
I assume it could be used to check if there is function loading at all, by simply blocking all function loading and getting an error message.
But is there a way to write a function that will simply say: there is only text data in this pickled object and no function loading?
I can't say I know which builtins are safe!
Basically no, there is truly no way. There is a lot written about this. You can only use pickle if you trust the source, and you get the pickle directly from the source.
Any safety measures you perform are not sufficiënt to protect against mallicious attempts whatsoever.
https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9
https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html
etcetera.
I use it sometimes, but then most of the times after I have had a phonecall with a colleague and shared a pickled file. But more often, I use it for myself on my local environment to store data. Still, this is not the preferred way, but it's fast.
So, when in doubt. Do not use pickle.
Thanks for the answers!
Straight pickle is too prone to security issues.
Picklemagic claims to fix the security issues; it looks like it does, but I can't quite confirm that: http://github.com/CensoredUsername/picklemagic
But https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9 suggest that there is no safe wrapper (picklemagic has been around for 8 years, and the article dates from 2021; so picklemagic was not considered?)
The only surefire way protect against Pickle Bombs is not to use pickle directly. Unfortunately unlike other unsafe standard library packages there are no safe wrappers or drop-in alternatives available for pickle, like defusedxml for xml or tarsafe for tarfile. Further there’s not a great way to inspect a pickle prior to unpickling or to block unsafe function calls invoked by REDUCE.
The 3.10 docs does offer a wrapper for blocking unauthorized execution of a function. https://docs.python.org/3.10/tutorial/controlflow.html
It does not say which builtins are safe. If os is removed, are the others safe? Still, if it is clear what is supposed to be in the pickled object, it may be easy enough to restrict execution.
import builtins
import io
import pickle
safe_builtins = {
'range',
'complex',
'set',
'frozenset',
'slice',
}
class RestrictedUnpickler(pickle.Unpickler):
def find_class(self, module, name):
# Only allow safe classes from builtins.
if module == "builtins" and name in safe_builtins:
return getattr(builtins, name)
# Forbid everything else.
raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
(module, name))
def restricted_loads(s):
"""Helper function analogous to pickle.loads()."""
return RestrictedUnpickler(io.BytesIO(s)).load()
I have a multiprocessing program on Device A which uses a queue and a SyncManager to make this accessible over the network. The queue stores a custom class from a module on the device which gets automatically pickled by the multiprocessing package as module.class.
On another device reading the queue via a SyncManager, I have the same module as part of a package instead of top-level as it was on Device A. This means I get a ModuleNotFoundError when I attempt to read an item from the queue as the unpickler doesn't know the module is now package.module.
I've seen this work-around which uses a new class based on pickler.Unpicker and seems the least hacky and extensible: https://stackoverflow.com/a/53327348/5683049
However, I don't know how to specify the multiprocessing unpickler class to use.
I see this can be done for the reducer class so I assume there is a way to also set the unpickler?
I have never seen a way to do this. You may have to hack around this. Let the multiprocessor system think you're passing byte strings or byte arrays, and have your user code perform the pickling and unpickling.
A hack? Yes. But not much worse that what you already have to do.
Using a mixture of:
How to change the serialization method used by the multiprocessing module?
https://stackoverflow.com/a/53327348/5683049
I was able to get this working using code similar to the following:
from multiprocessing.reduction import ForkingPickler, AbstractReducer
import pickle
import io
multiprocessing.context._default_context.reducer = MyPickleReducer()
class RenameUnpickler(pickle.Unpickler):
def find_class(self, module, name):
renamed_module = module
if module == "old_module_name":
renamed_module = "new_package.module_name"
return super(RenameUnpickler, self).find_class(renamed_module, name)
class MyForkingPickler(ForkingPickler):
# Method signature from pickle._loads
def loads(self, /, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None):
if isinstance(s, str):
raise TypeError("Can't load pickle from unicode string")
file = io.BytesIO(s)
return RenameUnpickler(file, fix_imports=fix_imports, buffers=buffers,
encoding=encoding, errors=errors).load()
class MyPickleReducer(AbstractReducer):
ForkingPickler = MyForkingPickler
register = MyForkingPickler.register
This could be useful if you want to further override how the unpickling is performed, but in my original case it is probably just easier to redirect the module using:
from new_package import module_name
sys.modules['old_module_name'] = module_name
My original issue is that I am trying to do the following:
def submit_decoder_process(decoder, input_line):
decoder.process_line(input_line)
return decoder
self.pool = Pool(processes=num_of_processes)
self.pool.apply_async(submit_decoder_process, [decoder, input_line]).get()
decoder is a bit involved to describe here, but the important thing is that decoder is an object that is initialized with PyParsing expression that calls setParseAction(). This fails pickle that multiprocessing uses and this in turn fails the above code.
Now, here is the pickle/PyParsing problem that I have isolated and simplified.
The following code yields an error message due to pickle failure.
import pickle
from pyparsing import *
def my_pa_func():
pass
pickle.dumps(Word(nums).setParseAction(my_pa_func))
Error message:
pickle.PicklingError: Can't pickle <function wrapper at 0x00000000026534A8>: it's not found as pyparsing.wrapper
Now If you remove the call .setParseAction(my_pa_func), it will work with no problems:
pickle.dumps(Word(nums))
How can I get around it? Multiprocesing uses pickle, so I can't avoid it, I guess. The pathos package that is supposedly uses dill is not mature enough, at least, I am having problems installing it on my Windows-64bit. I am really scratching my head here.
OK, here is the solution inspired by rocksportrocker: Python multiprocessing pickling error
The idea is to dill the object that can't be pickled while passing it back and forth between processes and then "undill" it after it has been passed:
from multiprocessing import Pool
import dill
def submit_decoder_process(decoder_dill, input_line):
decoder = dill.loads(decoder_dill) # undill after it was passed to a pool process
decoder.process_line(input_line)
return dill.dumps(decoder) # dill before passing back to parent process
self.pool = Pool(processes=num_of_processes)
# Dill before sending to a pool process
decoder_processed = dill.loads(self.pool.apply_async(submit_decoder_process, [dill.dumps(decoder), input_line]).get())
https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The multiprocessing.Pool uses the Pickle's protocol to serialize the function and module names (in your example setParseAction and pyparse) which are delivered through the Pipe to the child process.
The child process, once receives them, it imports the module and try to call the function. The problem is that what you're passing is not a function but a method. To resolve it, the Pickle protocol should be clever enough to build 'Word' object with the 'user' parameter and then call the setParseAction method. As handling these cases is too complicated, the Pickle protocol prevents you to serialize non top level functions.
To solve your issue either you instruct the Pickle's module on how to serialize the setParseAction method (https://docs.python.org/2/library/pickle.html#pickle-protocol) or you refactor your code in a way that what's passed to the Pool.apply_async is serializable.
What if you pass the Word object to the child process and you let it call the Word().setParseAction()?
I'd suggest pathos.multiprocessing, as you mention. Of course, I'm the pathos author, so I guess that's not a surprise. It appears that there might be a distutils bug that you are running into, as referenced here: https://github.com/uqfoundation/pathos/issues/49.
Your solution using dill is a good workaround. You also might be able to forgo installing the entire pathos package, and just install the pathos fork of the multiprocessing package (which uses dill instead of pickle). You can find it here: http://dev.danse.us/packages or here: https://github.com/uqfoundation/pathos/tree/master/external,
With the following code, it seems that the queue instance passed to the worker isn't initialized:
from multiprocessing import Process
from multiprocessing.queues import Queue
class MyQueue(Queue):
def __init__(self, name):
Queue.__init__(self)
self.name = name
def worker(queue):
print queue.name
if __name__ == "__main__":
queue = MyQueue("My Queue")
p = Process(target=worker, args=(queue,))
p.start()
p.join()
This throws:
... line 14, in worker
print queue.name
AttributeError: 'MyQueue' object has no attribute 'name'
I can't re-initialize the queue, because I'll loose the original value of queue.name, even passing the queue's name as an argument to the worker (this should work, but it's not a clean solution).
So, how can inherit from multiprocessing.queues.Queue without getting this error?
On POSIX, Queue objects are shared to the child processes by simple inheritance.*
On Windows, that isn't possible, so it has to pickle the Queue, send it over a pipe to the child, and unpickle it.
(This may not be obvious, because if you actually try to pickle a Queue, you get an exception, RuntimeError: MyQueue objects should only be shared between processes through inheritance. If you look through the source, you'll see that this is really a lie—it only raises this exception if you try to pickle a Queue when multiprocess is not in the middle of spawning a child process.)
Of course generic pickling and unpickling wouldn't do any good, because you'd end up with two identical queues, not the same queue in two processes. So, multiprocessing extends things a bit, by adding a register_after_fork mechanism for objects to use when unpickling.** If you look at the source for Queue, you can see how it works.
But you don't really need to know how it works to hook it; you can hook it the same way as any other class's pickling. For example, this should work:***
def __getstate__(self):
return self.name, super(MyQueue, self).__getstate__()
def __setstate__(self, state):
self.name, state = state
super(MyQueue, self).__setstate__(state)
For more details, the pickle documentation explains the different ways you can influence how your class is pickled.
(If it doesn't work, and I haven't made a stupid mistake… then you do have to know at least a little about how it works to hook it… but most likely just to figure out whether to do your extra work before or after the _after_fork(), which would just require swapping the last two lines…)
* I'm not sure it's actually guaranteed to use simple fork inheritance on POSIX platforms. That happens to be true on 2.7 and 3.3. But there's a fork of multiprocessing that uses the Windows-style pickle-everything on all platforms for consistency, and another one that uses a hybrid on OS X to allow using CoreFoundation in single-threaded mode, or something like that, and it's clearly doable that way.
** Actually, I think Queue is only using register_after_fork for convenience, and could be rewritten without it… but it's depending on the magic that Pipe does in its _after_fork on Windows, or Lock and BoundedSemaphore on POSIX.
*** This is only correct because I happen to know, from reading the source, that Queue is a new-style class, doesn't override __reduce__ or __reduce_ex, and never returns a falsey value from __getstate__. If you didn't know that, you'd have to write more code.
I have written a Python interface to a process-centric job distribution system we're developing/using internally at my workplace. While reasonably skilled programmers, the primary people using this interface are research scientists, not software developers, so ease-of-use and keeping the interface out of the way to the greatest degree possible is paramount.
My library unrolls a sequence of inputs into a sequence of pickle files on a shared file server, then spawns jobs that load those inputs, perform the computation, pickle the results, and exit; the client script then picks back up and produces a generator that loads and yields the results (or rethrows any exception the calculation function did.)
This is only useful since the calculation function itself is one of the serialized inputs. cPickle is quite content to pickle function references, but requires the pickled function to be reimportable in the same context. This is problematic. I've already solved the problem of finding the module to reimport it, but the vast majority of the time, it is a top-level function that is pickled and, thus, does not have a module path. The only strategy I've found to be able to unpickle such a function on the computation nodes is this nauseating little approach towards simulating the original environment in which the function was pickled before unpickling it:
...
# At this point, we've identified the source of the target function.
# A string by its name lives in "modname".
# In the real code, there is significant try/except work here.
targetModule = __import__(modname)
globalRef = globals()
for thingie in dir(targetModule):
if thingie not in globalRef:
globalRef[thingie] = targetModule.__dict__[thingie]
# sys.argv[2]: the path to the pickle file common to all jobs, which contains
# any data in common to all invocations of the target function, then the
# target function itself
commonFile = open(sys.argv[2], "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
# the actual function unpack I'm having trouble with:
doIt = commonUnpickle.load()
The final line is the most important one here- it's where my module is picking up the function it should actually be running. This code, as written, works as desired, but directly manipulating the symbol tables like this is unsettling.
How can I do this, or something very much like this that does not force the research scientists to separate their calculation scripts into a proper class structure (they use Python like the most excellent graphing calculator ever and I would like to continue to let them do so) the way Pickle desperately wants, without the unpleasant, unsafe, and just plain scary __dict__-and-globals() manipulation I'm using above? I fervently believe there has to be a better way, but exec "from {0} import *".format("modname") didn't do it, several attempts to inject the pickle load into the targetModule reference didn't do it, and eval("commonUnpickle.load()", targetModule.__dict__, locals()) didn't do it. All of these fail with Unpickle's AttributeError over being unable to find the function in <module>.
What is a better way?
Pickling functions can be rather annoying if trying to move them into a different context. If the function does not reference anything from the module that it is in and references (if anything) modules that are guaranteed to be imported, you might check some code from a Rudimentary Database Engine found on the Python Cookbook.
In order to support views, the academic module grabs the code from the callable when pickling the query. When it comes time to unpickle the view, a LambdaType instance is created with the code object and a reference to a namespace containing all imported modules. The solution has limitations but worked well enough for the exercise.
Example for Views
class _View:
def __init__(self, database, query, *name_changes):
"Initializes _View instance with details of saved query."
self.__database = database
self.__query = query
self.__name_changes = name_changes
def __getstate__(self):
"Returns everything needed to pickle _View instance."
return self.__database, self.__query.__code__, self.__name_changes
def __setstate__(self, state):
"Sets the state of the _View instance when unpickled."
database, query, name_changes = state
self.__database = database
self.__query = types.LambdaType(query, sys.modules)
self.__name_changes = name_changes
Sometimes is appears necessary to make modifications to the registered modules available in the system. If for example you need to make reference to the first module (__main__), you may need to create a new module with your available namespace loaded into a new module object. The same recipe used the following technique.
Example for Modules
def test_northwind():
"Loads and runs some test on the sample Northwind database."
import os, imp
# Patch the module namespace to recognize this file.
name = os.path.splitext(os.path.basename(sys.argv[0]))[0]
module = imp.new_module(name)
vars(module).update(globals())
sys.modules[name] = module
Your question was long, and I was too caffeinated to make it through your very long question… However, I think you are looking to do something that there's a pretty good existing solution for already. There's a fork of the parallel python (i.e. pp) library that takes functions and objects and serializes them, sends them to different servers, and then unpikles and executes them. The fork lives inside the pathos package, but you can download it independently here:
http://danse.cacr.caltech.edu/packages/dev_danse_us
The "other context" in that case is another server… and the objects are transported by converting the objects to source code and then back to objects.
If you are looking to use pickling, much in the way you are doing already, there's an extension to mpi4py that serializes arguments and functions, and returns pickled return values… The package is called pyina, and is commonly used to ship code and objects to cluster nodes in coordination with a cluster scheduler.
Both pathos and pyina provide map abstractions (and pipe), and try to hide all of the details of parallel computing behind the abstractions, so scientists don't need to learn anything except how to program normal serial python. They just use one of the map or pipe functions, and get parallel or distributed computing.
Oh, I almost forgot. The dill serializer includes dump_session and load_session functions that allow the user to easily serialize their entire interpreter session and send it to another computer (or just save it for later use). That's pretty handy for changing contexts, in a different sense.
Get dill, pathos, and pyina here: https://github.com/uqfoundation
For a module to be recognized as loaded I think it must by in sys.modules, not just its content imported into your global/local namespace. Try to exec everything, then get the result out of an artificial environment.
env = {"fn": sys.argv[2]}
code = """\
import %s # maybe more
import cPickle
commonFile = open(fn, "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
doIt = commonUnpickle.load()
"""
exec code in env
return env["doIt"]
While functions are advertised as first-class objects in Python, this is one case where it can be seen that they are really second-class objects. It is the reference to the callable, not the object itself, that is pickled. (You cannot directly pickle a lambda expression.)
There is an alternate usage of __import__ that you might prefer:
def importer(modulename, symbols=None):
u"importer('foo') returns module foo; importer('foo', ['bar']) returns {'bar': object}"
if modulename in sys.modules: module = sys.modules[modulename]
else: module = __import__(modulename, fromlist=['*'])
if symbols == None: return module
else: return dict(zip(symbols, map(partial(getattr, module), symbols)))
So these would all be basically equivalent:
from mymodule.mysubmodule import myfunction
myfunction = importer('mymodule.mysubmodule').myfunction
globals()['myfunction'] = importer('mymodule.mysubmodule', ['myfunction'])['myfunction']