How to inherit from a multiprocessing queue? - python

With the following code, it seems that the queue instance passed to the worker isn't initialized:
from multiprocessing import Process
from multiprocessing.queues import Queue
class MyQueue(Queue):
def __init__(self, name):
Queue.__init__(self)
self.name = name
def worker(queue):
print queue.name
if __name__ == "__main__":
queue = MyQueue("My Queue")
p = Process(target=worker, args=(queue,))
p.start()
p.join()
This throws:
... line 14, in worker
print queue.name
AttributeError: 'MyQueue' object has no attribute 'name'
I can't re-initialize the queue, because I'll loose the original value of queue.name, even passing the queue's name as an argument to the worker (this should work, but it's not a clean solution).
So, how can inherit from multiprocessing.queues.Queue without getting this error?

On POSIX, Queue objects are shared to the child processes by simple inheritance.*
On Windows, that isn't possible, so it has to pickle the Queue, send it over a pipe to the child, and unpickle it.
(This may not be obvious, because if you actually try to pickle a Queue, you get an exception, RuntimeError: MyQueue objects should only be shared between processes through inheritance. If you look through the source, you'll see that this is really a lie—it only raises this exception if you try to pickle a Queue when multiprocess is not in the middle of spawning a child process.)
Of course generic pickling and unpickling wouldn't do any good, because you'd end up with two identical queues, not the same queue in two processes. So, multiprocessing extends things a bit, by adding a register_after_fork mechanism for objects to use when unpickling.** If you look at the source for Queue, you can see how it works.
But you don't really need to know how it works to hook it; you can hook it the same way as any other class's pickling. For example, this should work:***
def __getstate__(self):
return self.name, super(MyQueue, self).__getstate__()
def __setstate__(self, state):
self.name, state = state
super(MyQueue, self).__setstate__(state)
For more details, the pickle documentation explains the different ways you can influence how your class is pickled.
(If it doesn't work, and I haven't made a stupid mistake… then you do have to know at least a little about how it works to hook it… but most likely just to figure out whether to do your extra work before or after the _after_fork(), which would just require swapping the last two lines…)
* I'm not sure it's actually guaranteed to use simple fork inheritance on POSIX platforms. That happens to be true on 2.7 and 3.3. But there's a fork of multiprocessing that uses the Windows-style pickle-everything on all platforms for consistency, and another one that uses a hybrid on OS X to allow using CoreFoundation in single-threaded mode, or something like that, and it's clearly doable that way.
** Actually, I think Queue is only using register_after_fork for convenience, and could be rewritten without it… but it's depending on the magic that Pipe does in its _after_fork on Windows, or Lock and BoundedSemaphore on POSIX.
*** This is only correct because I happen to know, from reading the source, that Queue is a new-style class, doesn't override __reduce__ or __reduce_ex, and never returns a falsey value from __getstate__. If you didn't know that, you'd have to write more code.

Related

Change default multiprocessing unpickler class

I have a multiprocessing program on Device A which uses a queue and a SyncManager to make this accessible over the network. The queue stores a custom class from a module on the device which gets automatically pickled by the multiprocessing package as module.class.
On another device reading the queue via a SyncManager, I have the same module as part of a package instead of top-level as it was on Device A. This means I get a ModuleNotFoundError when I attempt to read an item from the queue as the unpickler doesn't know the module is now package.module.
I've seen this work-around which uses a new class based on pickler.Unpicker and seems the least hacky and extensible: https://stackoverflow.com/a/53327348/5683049
However, I don't know how to specify the multiprocessing unpickler class to use.
I see this can be done for the reducer class so I assume there is a way to also set the unpickler?
I have never seen a way to do this. You may have to hack around this. Let the multiprocessor system think you're passing byte strings or byte arrays, and have your user code perform the pickling and unpickling.
A hack? Yes. But not much worse that what you already have to do.
Using a mixture of:
How to change the serialization method used by the multiprocessing module?
https://stackoverflow.com/a/53327348/5683049
I was able to get this working using code similar to the following:
from multiprocessing.reduction import ForkingPickler, AbstractReducer
import pickle
import io
multiprocessing.context._default_context.reducer = MyPickleReducer()
class RenameUnpickler(pickle.Unpickler):
def find_class(self, module, name):
renamed_module = module
if module == "old_module_name":
renamed_module = "new_package.module_name"
return super(RenameUnpickler, self).find_class(renamed_module, name)
class MyForkingPickler(ForkingPickler):
# Method signature from pickle._loads
def loads(self, /, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None):
if isinstance(s, str):
raise TypeError("Can't load pickle from unicode string")
file = io.BytesIO(s)
return RenameUnpickler(file, fix_imports=fix_imports, buffers=buffers,
encoding=encoding, errors=errors).load()
class MyPickleReducer(AbstractReducer):
ForkingPickler = MyForkingPickler
register = MyForkingPickler.register
This could be useful if you want to further override how the unpickling is performed, but in my original case it is probably just easier to redirect the module using:
from new_package import module_name
sys.modules['old_module_name'] = module_name

Can't pickle Pyparsing expression with setParseAction() method. Needed for multiprocessing

My original issue is that I am trying to do the following:
def submit_decoder_process(decoder, input_line):
decoder.process_line(input_line)
return decoder
self.pool = Pool(processes=num_of_processes)
self.pool.apply_async(submit_decoder_process, [decoder, input_line]).get()
decoder is a bit involved to describe here, but the important thing is that decoder is an object that is initialized with PyParsing expression that calls setParseAction(). This fails pickle that multiprocessing uses and this in turn fails the above code.
Now, here is the pickle/PyParsing problem that I have isolated and simplified.
The following code yields an error message due to pickle failure.
import pickle
from pyparsing import *
def my_pa_func():
pass
pickle.dumps(Word(nums).setParseAction(my_pa_func))
Error message:
pickle.PicklingError: Can't pickle <function wrapper at 0x00000000026534A8>: it's not found as pyparsing.wrapper
Now If you remove the call .setParseAction(my_pa_func), it will work with no problems:
pickle.dumps(Word(nums))
How can I get around it? Multiprocesing uses pickle, so I can't avoid it, I guess. The pathos package that is supposedly uses dill is not mature enough, at least, I am having problems installing it on my Windows-64bit. I am really scratching my head here.
OK, here is the solution inspired by rocksportrocker: Python multiprocessing pickling error
The idea is to dill the object that can't be pickled while passing it back and forth between processes and then "undill" it after it has been passed:
from multiprocessing import Pool
import dill
def submit_decoder_process(decoder_dill, input_line):
decoder = dill.loads(decoder_dill) # undill after it was passed to a pool process
decoder.process_line(input_line)
return dill.dumps(decoder) # dill before passing back to parent process
self.pool = Pool(processes=num_of_processes)
# Dill before sending to a pool process
decoder_processed = dill.loads(self.pool.apply_async(submit_decoder_process, [dill.dumps(decoder), input_line]).get())
https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The multiprocessing.Pool uses the Pickle's protocol to serialize the function and module names (in your example setParseAction and pyparse) which are delivered through the Pipe to the child process.
The child process, once receives them, it imports the module and try to call the function. The problem is that what you're passing is not a function but a method. To resolve it, the Pickle protocol should be clever enough to build 'Word' object with the 'user' parameter and then call the setParseAction method. As handling these cases is too complicated, the Pickle protocol prevents you to serialize non top level functions.
To solve your issue either you instruct the Pickle's module on how to serialize the setParseAction method (https://docs.python.org/2/library/pickle.html#pickle-protocol) or you refactor your code in a way that what's passed to the Pool.apply_async is serializable.
What if you pass the Word object to the child process and you let it call the Word().setParseAction()?
I'd suggest pathos.multiprocessing, as you mention. Of course, I'm the pathos author, so I guess that's not a surprise. It appears that there might be a distutils bug that you are running into, as referenced here: https://github.com/uqfoundation/pathos/issues/49.
Your solution using dill is a good workaround. You also might be able to forgo installing the entire pathos package, and just install the pathos fork of the multiprocessing package (which uses dill instead of pickle). You can find it here: http://dev.danse.us/packages or here: https://github.com/uqfoundation/pathos/tree/master/external,

Use multiprocessing to get information from multiple sockets

I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue
import socket
import multiprocessing as mp
import types
import copy_reg
import pickle
def _pickle_method(method):
func_name = method.im_func.__name__
obj = method.im_self
cls = method.im_class
return _unpickle_method, (func_name, obj, cls)
def _unpickle_method(func_name, obj, cls):
for cls in cls.mro():
try:
func = cls.__dict__[func_name]
except KeyError:
pass
else:
break
return func.__get__(obj, cls)
copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)
class a(object):
def __init__(self):
sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock1.connect((socket.gethostbyname("example.com"), 6667))
self.servers = {}
self.servers["example.com"] = sock1
def method(self, hostname):
self.servers[hostname].send("JOIN DAN\r\n")
print "1"
def oth_method(self):
pool = mp.Pool()
## pickle.dumps(self.method)
pool.map(self.method, self.servers.keys())
pool.close()
pool.join()
if __name__ == "__main__":
b = a()
b.oth_method()
Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error
TypeError: expected string or Unicode object, NoneType found
From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.
I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.
Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).
Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.
I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.
Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.
In more detail:
Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.
You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…
But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.
One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.
A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.
You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.
If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.
If you do want to try jumping out of the standard library, then the following code for pathos.multiprocessing (as you mention) should not throw pickling errors, as the dill serializer knows how to serialize sockets and file handles.
>>> import socket
>>> import pathos.multiprocessing as mp
>>> import types
>>> import dill as pickle
>>>
>>> class a(object):
... def __init__(self):
... sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
... sock1.connect((socket.gethostbyname("example.com"), 6667))
... self.servers = {}
... self.servers["example.com"] = sock1
... def method(self, hostname):
... self.servers[hostname].send("JOIN DAN\r\n")
... print "1"
... def oth_method(self):
... pool = mp.ProcessingPool()
... pool.map(self.method, self.servers.keys())
... pool.close()
... pool.join()
...
>>> b = a()
>>> b.oth_method()
One issue however is that you need serialization with multiprocessing, and in many cases the sockets will serialize so that the deserialized socket is closed. The reason is primarily because the file descriptor isn't copied as expected, it's copied by reference. With dill you can customize the serialization of file handles, so that the content does get transferred as opposed to using a reference… however, this doesn't translate well for a socket (at least at the moment).
I'm the dill and pathos author, and I'd have to agree with #abarnert that you probably don't want to do this with multiprocessing (at least not storing a map of servers and sockets). If you want to use multiprocessing's threading interface, and you find you run into any serialization concerns, pathos.multiprocessing does have mp.ThreadingPool() instead of mp.ProcessingPool(), so that you can access a wrapper around multiprocessing.dummy.Pool, but still get the additional features that pathos provides (such as multi-argument pools for blocking or asynchronous pipes and maps, etc).

Using Python's multiprocessing.Process class

This is a newbie question:
A class is an object, so I can create a class called pippo() and inside of this add function and parameter, I don't understand if the functions inside of pippo are executed from up to down when I assign x=pippo() or I must call them as x.dosomething() outside of pippo.
Working with Python's multiprocessing package, is it better to define a big function and create the object using the target argument in the call to Process(), or to create your own process class by inheriting from Process class?
I often wondered why Python's doc page on multiprocessing only shows the "functional" approach (using target parameter). Probably because terse, succinct code snippets are best for illustration purposes. For small tasks that fit in one function, I can see how that is the preferred way, ala:
from multiprocessing import Process
def f():
print('hello')
p = Process(target=f)
p.start()
p.join()
But when you need greater code organization (for complex tasks), making your own class is the way to go:
from multiprocessing import Process
class P(Process):
def __init__(self):
super(P, self).__init__()
def run(self):
print('hello')
p = P()
p.start()
p.join()
Bear in mind that each spawned process is initialized with a copy of the memory footprint of the master process. And that the constructor code (i.e. stuff inside __init__()) is executed in the master process -- only code inside run() executes in separate processes.
Therefore, if a process (master or spawned) changes it's member variable, the change will not be reflected in other processes. This, of course, is only true for bulit-in types, like bool, string, list, etc. You can however import "special" data structures from multiprocessing module which are then transparently shared between processes (see Sharing state between processes.) Or, you can create your own channels of IPC (inter-process communication) such as multiprocessing.Pipe and multiprocessing.Queue.

Tracing an ignored exception in Python?

My app has a custom audio library that itself uses the BASS library.
I create and destroy BASS stream objects throughout the program.
When my program exits, randomly (I haven't figured out the pattern yet) I get the following notice on my console:
Exception TypeError: "'NoneType' object is not callable" in <bound method stream.__del__ of <audio.audio_player.stream object at 0xaeda2f0>> ignored
My audio library (audio/audio_player.py [class Stream]) contains a class that creates a BASS stream object and then allows the code to manipulate it. When the class is destroyed (in the del routine) it calls BASS_StreamFree to clear any resources BASS might have allocated.
(audio_player.py)
from pybass import *
from ctypes import pointer, c_float, c_long, c_ulong, c_buffer
import os.path, time, threading
# initialize the BASS engine
BASS_Init(-1, 44100, 0, 0, None)
class stream(object):
"""Represents a single audio stream"""
def __init__(self, file):
# check for file existence
if (os.path.isfile(file) == False):
raise ValueError("File %s not found." % file)
# initialize a bass channel
self.cAddress = BASS_StreamCreateFile(False, file, 0, 0, 0)
def __del__(self):
BASS_StreamFree(self.cAddress)
def play(self):
BASS_ChannelPlay(self.cAddress, True)
while (self.playing == False):
pass
..more code..
My first inclination based on this message is that somewhere in my code, an instance of my stream class is being orphaned (no longer assigned to a variable) and Python still is trying to call its del function when the app closes, but by the time it tries the object has gone away.
This app does use wxWidgets and thus involves some threading. The fact that I'm not being given an actual variable name leads me to believe what I stated in the previous paragraph.
I'm not sure exactly what code would be relevant to debug this. The message does seem harmless but I don't like the idea of an "ignored" exception in the final production code.
Is there any tips anyone has for debugging this?
The message that the exception was ignored is because all exceptions raised in a __del__ method are ignored to keep the data model sane. Here's the relevant portion of the docs:
Warning: Due to the precarious circumstances under which __del__() methods are invoked, exceptions that occur during their execution are ignored, and a warning is printed to sys.stderr instead. Also, when __del__() is invoked in response to a module being deleted (e.g., when execution of the program is done), other globals referenced by the __del__() method may already have been deleted or in the process of being torn down (e.g. the import machinery shutting down). For this reason, __del__() methods should do the absolute minimum needed to maintain external invariants. Starting with version 1.5, Python guarantees that globals whose name begins with a single underscore are deleted from their module before other globals are deleted; if no other references to such globals exist, this may help in assuring that imported modules are still available at the time when the __del__() method is called.
As for debugging it, you could start by putting a try/except block around the code in your __del__ method and printing out more information about the program's state at the time it occurs. Or you could consider doing less in the __del__ method, or getting rid of it entirely!

Categories

Resources