python multiprocessing JoinableQueue PicklingError

python multiprocessing JoinableQueue PicklingError - python

Sorry...it seems that i asked a popular question, but i cannot find any helpful for my case from stackflow :P
so my code does the following things:
step 1. the parent process write task object into multiprocessing.JoinableQueue
step 2. child process(more than 1) read(get) the task object from the JoinableQueue and execute the task
my module structure is:
A.py
Class Task(object)
Class WorkerPool(object)
Class Worker(multiprocessing.Process)
def run() # here the step 2 is executed
Class TestGroup()
def loadTest() # here the step 1 above is executed, i.e. append the object of Task
What i understand is when mp.JoinableQueue is used, the objects appended should be pickable, i got the meaning of "the pickable" from https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
my questions are:
1. does the object of Task is pickable in my case?
I got the error below when the code appends task objects into JoinableQueue:
File "/usr/lib/python2.6/multiprocessing/queues.py", line 242, in _feed
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object1
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object2
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object3
send(obj)
PicklingError: Can't pickle : attribute lookup pysphere.resources.VimService_services_types.DynamicData_Holder failed
What's the general usage of mp.JoinableQueue? in my case, i need to use join() and task_done()
When i choose to use Queue.Queue instead of mp.JoinableQueue, the pickerror is just gone, However, checking from the log, i found that all the child processes keep working on the first object of the Queue, what's the possible reason of this situation?

The multiprocessing module in Python starts multiple processes to run your tasks. Since processes do not share memory, they need to be able to communicate using serialized data. multiprocessing uses the pickle module to do the serialization, thus the requirement that the objects you are passing to the tasks be picklable.
1) Your task object seems to contain an instance from pysphere.resource.VimService_services_types. This is probably a reference to a system resource, such as an open file. This cannot be serialized or passed from one process to another, and therefore it causes the pickling error.
What you can do with mp.JoinableQueue is pass the arguments you need to the task, and have it start the service in the task itself so that it is local to that process.
For example:
queue = mp.JoinableQueue()
# not queue.put(task), since the new process will create the task
queue.put(task_args)
def f(task_args):
task = Task(task_args)
...
# you can't return the task, unless you've closed all non-serializable parts
return task.result
process = Process(target=f, args=(queue,))
...
2) Queue.Queue is meant for threading. It uses shared memory and synchronization mechanisms to provide atomic operations. However, when you start a new process with multiprocessing, it copies the initial process, and so each child will work on the same queue objects, since the queue in memory has been copied for each process.

Related

ProcessPoolExecutor does not mutate instance variable when submitting instance method

Given an instance method that mutates an instance variable, running this method in the ProcessPoolExecutor does run the method but does not mutate the instance variable.
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
self.started = True
if __name__ == "__main__":
a = A()
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
assert a.started
Started...
Traceback (most recent call last):
File "/path/to/file", line 19, in <module>
assert a.started
AssertionError
Are only pure functions allowed in ProcessPoolExecutor?

For Windows
Multiprocessing does not share it's state with the child processes on Windows systems. This is because the default way to start child processes on Windows is through spawn. From the documentation for method spawn
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver
Therefore, when you pass any objects to child processes, they are actually copied, and do not have the same memory address as in the parent process. A simple way to demonstrate this through your example would be to print the objects in the child process and the parent process:
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
print(f'Child proc: {self}')
self.started = True
if __name__ == "__main__":
a = A()
print(f'Parent proc: {a}')
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
Output
Parent proc: <__main__.A object at 0x0000028F44B40FD0>
Started...
Child proc: <__mp_main__.A object at 0x0000019D2B8E64C0>
As you can see, both objects reside at different places in the memory. Altering one would not affect the other whatsoever. This is the reason why you don't see any changes to a.started in the parent process.
Once you understand this, your question then becomes then how to share the same object, rather than copies, to the child processes. There are numerous ways to go about this, and questions on how to share complex objects like a have already been asked and answered on stackoverflow.
For UNIX
The same could be said for other methods of starting new processes that UNIX based systems have the option of using (I am not sure the default for concurrent.futures on OSX). For example, from the documentation for multiprocessing, fork is explained as
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
So fork creates child processes that share the entire memory space of the parent process on start. However, it uses copy-on-write to do so. What this means is that if you attempt to modify any object that is shared from within the child process, it will have to create a duplicate of that particular object as to not interrupt the parent process and localize that object to the child process (much like what spawn does on start).
Hence the answer still stands: if you plan to modify the objects passed to the child process, or if you are not on UNIX systems, you will need to share the objects yourself to have them point to the same memory address
Further reading on start methods.

How to fork and join multiple subprocesses with a global timeout in Python?

I want to execute some tasks in parallel in multiple subprocesses and time out if the tasks were not completed within some delay.
A first approach consists in forking and joining the subprocesses individually with remaining timeouts computed with respect to the global timeout, like suggested in this answer. It works fine for me.
A second approach, which I want to use here, consists in creating a pool of subprocesses and waiting with the global timeout, like suggested in this answer.
However I have a problem with the second approach: after feeding the pool of subprocesses with tasks that have multiprocessing.Event() objects, waiting for their completion raises this exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
Here is the Python code snippet:
import multiprocessing.pool
import time
class Worker:
def __init__(self):
self.event = multiprocessing.Event() # commenting this removes the RuntimeError
def work(self, x):
time.sleep(1)
return x * 10
if __name__ == "__main__":
pool_size = 2
timeout = 5
with multiprocessing.pool.Pool(pool_size) as pool:
result = pool.map_async(Worker().work, [4, 5, 2, 7])
print(result.get(timeout)) # raises the RuntimeError

In the "Programming guidlines" section of the multiprocessing — Process-based parallelism documentation, there is this paragraph:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So multiprocessing.Event() caused a RuntimeError because it is not pickable, as demonstrated by the following Python code snippet:
import multiprocessing
import pickle
pickle.dumps(multiprocessing.Event())
which raises the same exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
A solution is to use a proxy object:
A proxy is an object which refers to a shared object which lives (presumably) in a different process.
because:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
multiprocessing.Manager().Event() creates a shared threading.Event() object and returns a proxy for it, so replacing this line:
self.event = multiprocessing.Event()
by the following line in the Python code snippet of the question solves the problem:
self.event = multiprocessing.Manager().Event()

multiprocessing's Queue inside Manger.Namespace()

I am currently creating a class which is supposed to execute some methods in a multi-threaded way, using the multiprocessing module. I execute the real computation using a Pool of n workers. Now I wanted to assign each of the currently n active workers an index between 0 and n for some other calculation. To do this, I wanted to use a shared Queue to assign an index in a way, that at every time no two workers have the same id. To share the same Queue inside the class between the different threads, I wanted to store it inside a Manager.Namespace(). But doing this, I got some problems with the Queue. Therefore, I created a minimal version of my problem and ended up with something like this:
from multiprocess import Process, Queue, Manager, Pool, cpu_count
class A(object):
def __init__(self):
manager = Manager()
self.ns = manager.Namespace()
self.ns.q = manager.Queue()
def foo(self):
for i in range(10):
print(i)
self.ns.q.put(i)
print(self.ns.q.get())
print(self.ns.q.qsize())
a = A()
a.foo()
In this code, the execution stops before the second print statement - therefore, I think, that no data is actually written in the Queue. When I remove the namespace related stuff the code works flawlessly. Is this the intended behaviour of the multiprocessings objects and am I doing something wrong? Or is this some kind of bug?

yes, you should not use Namespace here. when you put a Queue object into manager.Namespace(), each process will get a new Queue instance, all the writer/reader of those newly created queue objects have no connection with parent process, therefore no message will be received by worker processes. share a Queue solely instead.
by the way, you mentioned "thread" many times, but in the context of multiprocess module, a worker is a process, not a thread.

Multiprocessing with pool in python: About several instances with same name at the same time

I'm kind of new to multiprocessing. However, assume that we have a program as below. The program seems to work fine. Now to the question. In my opinion we will have 4 instances of SomeKindOfClass with the same name (a) at the same time. How is that possible? Moreover, is there a potential risk with this kind of programming?
from multiprocessing.dummy import Pool
import numpy
from theFile import someKindOfClass
n = 8
allOutputs = numpy.zeros(n)
def work(index):
a = SomeKindOfClass()
a.theSlowFunction()
allOutputs[index] = a.output
pool = Pool(processes=4)
pool.map(work,range(0,n))

The name a is only local in scope within your work function, so there is no conflict of names here. Internally python will keep track of each class instance with a unique identifier. If you wanted to check this you could check the object id using the id function:
print(id(a))
I don't see any issues with your code.

Actually, you will have 8 instances of SomeKindOfClass (one for each worker), but only 4 will ever be active at the same time.
multiprocessing vs multiprocessing.dummy
Your program will only work if you continue to use the multiprocessing.dummy module, which is just a wrapper around the threading module. You are still using "python threads" (not separate processes). "Python threads" share the same global state; "Processes" don't. Python threads also share the same GIL, so they're still limited to running one python bytecode statement at a time, unlike processes, which can all run python code simultaneously.
If you were to change your import to from multiprocessing import Pool, you would notice that the allOutputs array remains unchanged after all the workers finish executing (also, you would likely get an error because you're creating the pool in the global scope, you should probably put that inside a main() function). This is because multiprocessing makes a new copy of the entire global state when it makes a new process. When the worker modifies the global allOutputs, it will be modifying a copy of that initial global state. When the process ends, nothing will be returned to the main process and the global state of the main process will remain unchanged.
Sharing State Between Processes
Unlike threads, processes aren't sharing the same memory
If you want to share state between processes, you have to explicitly declare shared variables and pass them to each process, or use pipes or some other method to allow the worker processes to communicate with each other or with the main process.
There are several ways to do this, but perhaps the simplest is using the Manager class
import multiprocessing
def worker(args):
index, array = args
a = SomeKindOfClass()
a.some_expensive_function()
array[index] = a.output
def main():
n = 8
manager = multiprocessing.Manager()
array = manager.list([0] * n)
pool = multiprocessing.Pool(4)
pool.map(worker, [(i, array) for i in range(n)])
print array

You can declare class instances inside the pool workers, because each instance has a separate place in memory so they don't conflict. The problem is if you declare a class instance first, then try to pass that one instance into multiple pool workers. Then each worker has a pointer to the same place in memory, and it will fail (this can be handled, just not this way).
Basically pool workers must not have overlapping memory anywhere. As long as the workers don't try to share memory somewhere, or perform operations that may result in collisions (like printing to the same file), there shouldn't be any problem.
Make sure whatever they're supposed to do (like something you want printed to a file, or added to a broader namespace somewhere) is returned as a result at the end, which you then iterate through.

If you are using multiprocessing you shouldn't worry - process doesn't share memory (by-default). So, there is no any risk to have several independent objects of class SomeKindOfClass - each of them will live in own process. How it works? Python runs your program and after that it runs 4 child processes. That's why it's very important to have if __init__ == '__main__' construction before pool.map(work,range(0,n)). Otherwise you will receive a infinity loop of process creation.
Problems could be if SomeKindOfClass keeps state on disk - for example, write something to file or read it.

Do I need to explicitly pass multiprocessing.Queue instance variables to a child Process executing on an instance method?

I have few basic questions when it comes to using Python's multiprocessing module :
class Someparallelworkerclass(object) :
def __init__(self):
self.num_workers = 4
self.work_queue = multiprocessing.JoinableQueue()
self.result_queue = multiprocessing.JoinableQueue()
def someparallellazymethod(self):
p = multiprocessing.Process(target=self.worktobedone).start()
def worktobedone(self):
# get data from work_queue
# put back result in result queue
Is it necessary to pass work_queue and result_queue as args to Process? Does the answer depends on the OS? The more fundamental question is: does the child process get a copied (COW) address space from the parent process, and hence knows the definition of the class/class method? If yes, how does it know that the queues are to be shared for IPC, and that it shouldn't make duplicates of the work_queue and result_queue in the child process? I tried searching this online but most of the documentation I found was vague, and didn't go into enough details as what exactly is happening underneath.

It's actually not necessary to include the queues in the args argument in this case, no matter what platform you're using. The reason is that even though it doesn't look like you're explicitly passing the two JoinableQueue instances to the child, you actually are - via self. Because self is explicitly being passed to the child, and the two queues are a part of self, they end up being passed along to the child.
On Linux, this happens via os.fork(), which means that file descriptors used by the multiprocessing.connection.Connection objects that the Queue uses internally for inter-process communication are inherited by the child (not copied). Other parts of the Queue become copy-on-write, but that's ok; multiprocessing.Queue is designed so that none of the pieces that need to be copied actually need to stay in sync between the two processes. In fact, many of the internal attributes get reset after the fork occurs:
def _after_fork(self):
debug('Queue._after_fork()')
self._notempty = threading.Condition(threading.Lock())
self._buffer = collections.deque()
self._thread = None
self._jointhread = None
self._joincancelled = False
self._closed = False
self._close = None
self._send = self._writer.send # _writer is a
self._recv = self._reader.recv
self._poll = self._reader.poll
So that covers Linux. How about Windows? Windows doesn't have fork, so it will need to pickle self to send it to the child, and that includes pickling our Queues. Now, normally if you try to pickle a multiprocessing.Queue, it fails:
>>> import multiprocessing
>>> q = multiprocessing.Queue()
>>> import pickle
>>> pickle.dumps(q)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/local/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/local/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/local/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
dict = getstate()
File "/usr/local/lib/python2.7/multiprocessing/queues.py", line 77, in __getstate__
assert_spawning(self)
File "/usr/local/lib/python2.7/multiprocessing/forking.py", line 52, in assert_spawning
' through inheritance' % type(self).__name__
RuntimeError: Queue objects should only be shared between processes through inheritance
But this is actually an artificial limitation. multiprocessing.Queue objects can be pickled in some cases - how else could they be sent to child processes in Windows? And indeed, we can see that if we look at the implementation:
def __getstate__(self):
assert_spawning(self)
return (self._maxsize, self._reader, self._writer,
self._rlock, self._wlock, self._sem, self._opid)
def __setstate__(self, state):
(self._maxsize, self._reader, self._writer,
self._rlock, self._wlock, self._sem, self._opid) = state
self._after_fork()
__getstate__, which is called when pickling an instance, has an assert_spawning call in it, which makes sure we're actually spawning a process while attempting the pickle*. __setstate__, which is called while unpickling, is responsible for calling _after_fork.
So how are the Connection objects used by the queues maintained when we have to pickle? It turns out there's a multiprocessing sub-module that does exactly that - multiprocessing.reduction. The comment at the top of the module states it pretty clearly:
#
# Module to allow connection and socket objects to be transferred
# between processes
#
On Windows, the module ultimately uses the DuplicateHandle API provided by Windows to create a duplicate handle that the child process' Connection object can use. So while each process gets its own handle, they're exact duplicates - any action made on one is reflected on the other:
The duplicate handle refers to the same object as the original handle.
Therefore, any changes to the object are reflected through both
handles. For example, if you duplicate a file handle, the current file
position is always the same for both handles.
* See this answer for more information about assert_spawning

The child process doesn't have the queues in its closure. It's instances of the queues reference different areas of memory. When using queues the way you intend you must pass them as args to the function. one solution I like is to use functools.partial to curry your functions with the queues you want, adding them permanently to its closure and letting you spin up multiple threads to perform the same task with the same IPC channel.

The child process does not get a copied address space. The child is a completely separate python process with nothing shared. Yes, you have to pass the queues to the child. When you do so, multiprocessing automatically handles the sharing via IPC. See https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python multiprocessing JoinableQueue PicklingError - python

Related

ProcessPoolExecutor does not mutate instance variable when submitting instance method

How to fork and join multiple subprocesses with a global timeout in Python?

multiprocessing's Queue inside Manger.Namespace()

Multiprocessing with pool in python: About several instances with same name at the same time

Do I need to explicitly pass multiprocessing.Queue instance variables to a child Process executing on an instance method?

Categories

Resources