ProcessPoolExecutor does not mutate instance variable when submitting instance method - python

Given an instance method that mutates an instance variable, running this method in the ProcessPoolExecutor does run the method but does not mutate the instance variable.
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
self.started = True
if __name__ == "__main__":
a = A()
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
assert a.started
Started...
Traceback (most recent call last):
File "/path/to/file", line 19, in <module>
assert a.started
AssertionError
Are only pure functions allowed in ProcessPoolExecutor?

For Windows
Multiprocessing does not share it's state with the child processes on Windows systems. This is because the default way to start child processes on Windows is through spawn. From the documentation for method spawn
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver
Therefore, when you pass any objects to child processes, they are actually copied, and do not have the same memory address as in the parent process. A simple way to demonstrate this through your example would be to print the objects in the child process and the parent process:
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
print(f'Child proc: {self}')
self.started = True
if __name__ == "__main__":
a = A()
print(f'Parent proc: {a}')
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
Output
Parent proc: <__main__.A object at 0x0000028F44B40FD0>
Started...
Child proc: <__mp_main__.A object at 0x0000019D2B8E64C0>
As you can see, both objects reside at different places in the memory. Altering one would not affect the other whatsoever. This is the reason why you don't see any changes to a.started in the parent process.
Once you understand this, your question then becomes then how to share the same object, rather than copies, to the child processes. There are numerous ways to go about this, and questions on how to share complex objects like a have already been asked and answered on stackoverflow.
For UNIX
The same could be said for other methods of starting new processes that UNIX based systems have the option of using (I am not sure the default for concurrent.futures on OSX). For example, from the documentation for multiprocessing, fork is explained as
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
So fork creates child processes that share the entire memory space of the parent process on start. However, it uses copy-on-write to do so. What this means is that if you attempt to modify any object that is shared from within the child process, it will have to create a duplicate of that particular object as to not interrupt the parent process and localize that object to the child process (much like what spawn does on start).
Hence the answer still stands: if you plan to modify the objects passed to the child process, or if you are not on UNIX systems, you will need to share the objects yourself to have them point to the same memory address
Further reading on start methods.

Related

Multiprocessing issue: function calling [duplicate]

This question already has answers here:
Appending to the same list from different processes using multiprocessing
(3 answers)
Closed 2 years ago.
from multiprocessing import Process
a=[]
def one():
for i in range(3):
a.append(i)
def main():
p1=Process(target=one)
p1.start()
if __name__=='__main__':
main()
print('After calling from Multi-process')
print(a)
one()
print('Calling outside Multi-process')
print(a)
Output:
After calling from Multi-process:
[]
Calling outside Multi-process:
[0, 1, 2]
Why elements are not getting appended to a when calling the function one from Process?
Process of multi-processing creates sub-processes where the all relevant memories are copied over and separately modified, that is, it doesn't share the memory locations for the global variables.
If you really want to make this work, you can use Threading instead of Process. That does share the global memory locations as opposed to making multiple copies of global variables.
Do from threading import Thread and p1=Thread(target=one) instead.
I checked the documentation and I have some understanding, which may be helpful to you.
The python documentation has the following description of the Process class in the multiprocessing package:
Contexts and start methods
Depending on the platform, multiprocessing supports three ways to start a process. These start methods are
spawn
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
Available on Unix only. The default on Unix.
forkserver
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Available on Unix platforms which support passing file descriptors over Unix pipes.
From the description, we can know that the Process class uses os.fork to run the method specified by target in several cases.
os.fork will completely copy all objects of the parent process except for resources such as file descriptors for use by the child process.
Therefore, the list a operated by the one method running in the child process is the unique memory space of the child process itself, and will not change the list a owned by the parent process.
To verify, we can simply modify the one method, example:
def one():
for i in range(3):
a.append(i)
print('pid:', os.getpid(), 'ppid:', os.getppid(), 'list:', a)
Then we run this script again, and we will get the following results:
After calling from Multi-process
[] // Print directly in the parent process
pid: 6990 ppid: 1419 list: [0] // Call the `one` method in the parent process
pid: 6990 ppid: 1419 list: [0, 1] // Call the `one` method in the parent process
pid: 6990 ppid: 1419 list: [0, 1, 2] // Call the `one` method in the parent process
Calling outside Multi-process
[0, 1, 2] // Print directly in the parent process
pid: 6991 ppid: 6990 list: [0] // Call the `one` method in the child process
pid: 6991 ppid: 6990 list: [0, 1] // Call the `one` method in the child process
pid: 6991 ppid: 6990 list: [0, 1, 2] // Call the `one` method in the child process

Child process not inheriting Python global variable on Windows

I know that child processes won't see changes made after a fork/spawn, and Windows processes don't inherit globals not using shared memory. But what I have is a situation where the children can't see changes to a global variable in shared memory made before the fork/spawn.
Simple demonstration:
from multiprocessing import Process, Value
global foo
foo = Value('i',1)
def printfoo():
global foo
with foo.get_lock():
print(foo.value)
if __name__ == '__main__':
with foo.get_lock():
foo.value = 2
Process(target=printfoo).start()
On Linux and MacOS, this displays the expected 2. On Windows, it displays 1, even though the modification to the global Value is made before the call to Process. How can I make the change visible to the child process on Windows, too?
The problem here is that your child process creates a new shared value, rather than using the one the parent created. Your parent process needs to explicitly send the Value to the child, for example, as an argument to the target function:
from multiprocessing import Process, Value
def use_shared_value(val):
val.value = 2
if __name__ == '__main__':
val = Value('i', 1)
p = Process(target=use_shared_value, args=(val,))
p.start()
p.join()
print(val.value)
(Unfortunately, I don't have a Windows Python install to test this on.)
Child processes cannot inherit globals on Windows, regardless of whether those globals are initialized to multiprocessing.Value instances. multiprocessing.Value does not change the fact that the child re-executes your file, and re-executing the Value construction doesn't use the shared resources the parent allocated.

How to fork and join multiple subprocesses with a global timeout in Python?

I want to execute some tasks in parallel in multiple subprocesses and time out if the tasks were not completed within some delay.
A first approach consists in forking and joining the subprocesses individually with remaining timeouts computed with respect to the global timeout, like suggested in this answer. It works fine for me.
A second approach, which I want to use here, consists in creating a pool of subprocesses and waiting with the global timeout, like suggested in this answer.
However I have a problem with the second approach: after feeding the pool of subprocesses with tasks that have multiprocessing.Event() objects, waiting for their completion raises this exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
Here is the Python code snippet:
import multiprocessing.pool
import time
class Worker:
def __init__(self):
self.event = multiprocessing.Event() # commenting this removes the RuntimeError
def work(self, x):
time.sleep(1)
return x * 10
if __name__ == "__main__":
pool_size = 2
timeout = 5
with multiprocessing.pool.Pool(pool_size) as pool:
result = pool.map_async(Worker().work, [4, 5, 2, 7])
print(result.get(timeout)) # raises the RuntimeError
In the "Programming guidlines" section of the multiprocessing — Process-based parallelism documentation, there is this paragraph:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So multiprocessing.Event() caused a RuntimeError because it is not pickable, as demonstrated by the following Python code snippet:
import multiprocessing
import pickle
pickle.dumps(multiprocessing.Event())
which raises the same exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
A solution is to use a proxy object:
A proxy is an object which refers to a shared object which lives (presumably) in a different process.
because:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
multiprocessing.Manager().Event() creates a shared threading.Event() object and returns a proxy for it, so replacing this line:
self.event = multiprocessing.Event()
by the following line in the Python code snippet of the question solves the problem:
self.event = multiprocessing.Manager().Event()

Do I need to explicitly pass multiprocessing.Queue instance variables to a child Process executing on an instance method?

I have few basic questions when it comes to using Python's multiprocessing module :
class Someparallelworkerclass(object) :
def __init__(self):
self.num_workers = 4
self.work_queue = multiprocessing.JoinableQueue()
self.result_queue = multiprocessing.JoinableQueue()
def someparallellazymethod(self):
p = multiprocessing.Process(target=self.worktobedone).start()
def worktobedone(self):
# get data from work_queue
# put back result in result queue
Is it necessary to pass work_queue and result_queue as args to Process? Does the answer depends on the OS? The more fundamental question is: does the child process get a copied (COW) address space from the parent process, and hence knows the definition of the class/class method? If yes, how does it know that the queues are to be shared for IPC, and that it shouldn't make duplicates of the work_queue and result_queue in the child process? I tried searching this online but most of the documentation I found was vague, and didn't go into enough details as what exactly is happening underneath.
It's actually not necessary to include the queues in the args argument in this case, no matter what platform you're using. The reason is that even though it doesn't look like you're explicitly passing the two JoinableQueue instances to the child, you actually are - via self. Because self is explicitly being passed to the child, and the two queues are a part of self, they end up being passed along to the child.
On Linux, this happens via os.fork(), which means that file descriptors used by the multiprocessing.connection.Connection objects that the Queue uses internally for inter-process communication are inherited by the child (not copied). Other parts of the Queue become copy-on-write, but that's ok; multiprocessing.Queue is designed so that none of the pieces that need to be copied actually need to stay in sync between the two processes. In fact, many of the internal attributes get reset after the fork occurs:
def _after_fork(self):
debug('Queue._after_fork()')
self._notempty = threading.Condition(threading.Lock())
self._buffer = collections.deque()
self._thread = None
self._jointhread = None
self._joincancelled = False
self._closed = False
self._close = None
self._send = self._writer.send # _writer is a
self._recv = self._reader.recv
self._poll = self._reader.poll
So that covers Linux. How about Windows? Windows doesn't have fork, so it will need to pickle self to send it to the child, and that includes pickling our Queues. Now, normally if you try to pickle a multiprocessing.Queue, it fails:
>>> import multiprocessing
>>> q = multiprocessing.Queue()
>>> import pickle
>>> pickle.dumps(q)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/local/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/local/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/local/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
dict = getstate()
File "/usr/local/lib/python2.7/multiprocessing/queues.py", line 77, in __getstate__
assert_spawning(self)
File "/usr/local/lib/python2.7/multiprocessing/forking.py", line 52, in assert_spawning
' through inheritance' % type(self).__name__
RuntimeError: Queue objects should only be shared between processes through inheritance
But this is actually an artificial limitation. multiprocessing.Queue objects can be pickled in some cases - how else could they be sent to child processes in Windows? And indeed, we can see that if we look at the implementation:
def __getstate__(self):
assert_spawning(self)
return (self._maxsize, self._reader, self._writer,
self._rlock, self._wlock, self._sem, self._opid)
def __setstate__(self, state):
(self._maxsize, self._reader, self._writer,
self._rlock, self._wlock, self._sem, self._opid) = state
self._after_fork()
__getstate__, which is called when pickling an instance, has an assert_spawning call in it, which makes sure we're actually spawning a process while attempting the pickle*. __setstate__, which is called while unpickling, is responsible for calling _after_fork.
So how are the Connection objects used by the queues maintained when we have to pickle? It turns out there's a multiprocessing sub-module that does exactly that - multiprocessing.reduction. The comment at the top of the module states it pretty clearly:
#
# Module to allow connection and socket objects to be transferred
# between processes
#
On Windows, the module ultimately uses the DuplicateHandle API provided by Windows to create a duplicate handle that the child process' Connection object can use. So while each process gets its own handle, they're exact duplicates - any action made on one is reflected on the other:
The duplicate handle refers to the same object as the original handle.
Therefore, any changes to the object are reflected through both
handles. For example, if you duplicate a file handle, the current file
position is always the same for both handles.
* See this answer for more information about assert_spawning
The child process doesn't have the queues in its closure. It's instances of the queues reference different areas of memory. When using queues the way you intend you must pass them as args to the function. one solution I like is to use functools.partial to curry your functions with the queues you want, adding them permanently to its closure and letting you spin up multiple threads to perform the same task with the same IPC channel.
The child process does not get a copied address space. The child is a completely separate python process with nothing shared. Yes, you have to pass the queues to the child. When you do so, multiprocessing automatically handles the sharing via IPC. See https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes.

python multiprocessing JoinableQueue PicklingError

Sorry...it seems that i asked a popular question, but i cannot find any helpful for my case from stackflow :P
so my code does the following things:
step 1. the parent process write task object into multiprocessing.JoinableQueue
step 2. child process(more than 1) read(get) the task object from the JoinableQueue and execute the task
my module structure is:
A.py
Class Task(object)
Class WorkerPool(object)
Class Worker(multiprocessing.Process)
def run() # here the step 2 is executed
Class TestGroup()
def loadTest() # here the step 1 above is executed, i.e. append the object of Task
What i understand is when mp.JoinableQueue is used, the objects appended should be pickable, i got the meaning of "the pickable" from https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
my questions are:
1. does the object of Task is pickable in my case?
I got the error below when the code appends task objects into JoinableQueue:
File "/usr/lib/python2.6/multiprocessing/queues.py", line 242, in _feed
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object1
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object2
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object3
send(obj)
PicklingError: Can't pickle : attribute lookup pysphere.resources.VimService_services_types.DynamicData_Holder failed
What's the general usage of mp.JoinableQueue? in my case, i need to use join() and task_done()
When i choose to use Queue.Queue instead of mp.JoinableQueue, the pickerror is just gone, However, checking from the log, i found that all the child processes keep working on the first object of the Queue, what's the possible reason of this situation?
The multiprocessing module in Python starts multiple processes to run your tasks. Since processes do not share memory, they need to be able to communicate using serialized data. multiprocessing uses the pickle module to do the serialization, thus the requirement that the objects you are passing to the tasks be picklable.
1) Your task object seems to contain an instance from pysphere.resource.VimService_services_types. This is probably a reference to a system resource, such as an open file. This cannot be serialized or passed from one process to another, and therefore it causes the pickling error.
What you can do with mp.JoinableQueue is pass the arguments you need to the task, and have it start the service in the task itself so that it is local to that process.
For example:
queue = mp.JoinableQueue()
# not queue.put(task), since the new process will create the task
queue.put(task_args)
def f(task_args):
task = Task(task_args)
...
# you can't return the task, unless you've closed all non-serializable parts
return task.result
process = Process(target=f, args=(queue,))
...
2) Queue.Queue is meant for threading. It uses shared memory and synchronization mechanisms to provide atomic operations. However, when you start a new process with multiprocessing, it copies the initial process, and so each child will work on the same queue objects, since the queue in memory has been copied for each process.

Categories

Resources