multiprocessing ignores "__setstate__" - python

I assumed that the multiprocessing package used pickle to send things between processes. However, pickle pays attention to the __getstate__ and __setstate__ methods of an object. Multiprocessing seems to ignore them. Is this correct? Am I confused?
To replicate, install docker, and type into command line
$ docker run python:3.4 python -c "import pickle
import multiprocessing
import os
class Tricky:
def __init__(self,x):
self.data=x
def __setstate__(self,d):
self.data=10
def __getstate__(self):
return {}
def report(ar,q):
print('running report in pid %d, hailing from %d'%(os.getpid(),os.getppid()))
q.put(ar.data)
print('module loaded in pid %d, hailing from pid %d'%(os.getpid(),os.getppid()))
if __name__ == '__main__':
print('hello from pid %d'%os.getpid())
ar = Tricky(5)
q = multiprocessing.Queue()
p = multiprocessing.Process(target=report, args=(ar, q))
p.start()
p.join()
print(q.get())
print(pickle.loads(pickle.dumps(ar)).data)"
You should get something like
module loaded in pid 1, hailing from pid 0
hello from pid 1
running report in pid 5, hailing from 1
5
10
I would have thought it would have been "10" "10" but instead it is "5" "10". What could it mean?
(note: code edited to comply with programming guidelines, as suggested by user3667217)

The multiprocessing module can start one of three ways: spawn, fork, or forkserver. By default on unix, it forks. That means that there's no need to pickle anything that's already loaded into ram at the moment the new process is born.
If you need more direct control over how you want the fork to take place, you need to change the startup setting to spawn. To do this, create a context
ctx=multiprocessing.get_context('spawn')
and replace all calls to multiprocessing.foo() with calls to ctx.foo(). When you do this, every new process is born as a fresh python instance; everything that gets sent into it will be sent via pickle, instead of direct memcopy.

Reminder: when you're using multiprocessing, you need to start a process in an 'if __name__ == '__main__': clause: (see programming guidelines)
import pickle
import multiprocessing
class Tricky:
def __init__(self,x):
self.data=x
def __setstate__(self, d):
print('setstate happening')
self.data = 10
def __getstate__(self):
return self.data
print('getstate happening')
def report(ar,q):
q.put(ar.data)
if __name__ == '__main__':
ar = Tricky(5)
q = multiprocessing.Queue()
p = multiprocessing.Process(target=report, args=(ar, q))
print('now starting process')
p.start()
print('now joining process')
p.join()
print('now getting results from queue')
print(q.get())
print('now getting pickle dumps')
print(pickle.loads(pickle.dumps(ar)).data)
On windows, I see
now starting process
now joining process
setstate happening
now getting results from queue
10
now getting pickle dumps
setstate happening
10
On Ubuntu, I see:
now starting process
now joining process
now getting results from queue
5
now getting pickle dumps
getstate happening
setstate happening
10
I suppose this should answer your question. The multiprocess invokes __setstate__ method on Windows but not on Linux. And on Linux, when you call pickle.dumps it first call __getstate__, then __setstate__. It's interesting to see how multiprocessing module is behaving differently on different platforms.

Related

How is the multiprocessing.Queue instance serialized when passed as an argument to a multiprocessing.Process?

A related question came up at Why I can't use multiprocessing.Queue with ProcessPoolExecutor?. I provided a partial answer along with a workaround but admitted that the question raises another question, namely why a multiprocessing.Queue instance can be passed as the argument to a multiprocessing.Process worker function.
For example, the following code fails under platforms that use either the spawn or fork method of creating new processes:
from multiprocessing import Pool, Queue
def worker(q):
print(q.get())
with Pool(1) as pool:
q = Queue()
q.put(7)
pool.apply(worker, args=(q,))
The above raises:
RuntimeError: Queue objects should only be shared between processes through inheritance
Yet the following program runs without a problem:
from multiprocessing import Process, Queue
def worker(q):
print(q.get())
q = Queue()
q.put(7)
p = Process(target=worker, args=(q,))
p.start()
p.join()
It appears that arguments to a multiprocessing pool worker function ultimately get put on the pool's input queue, which is implemented as a multiprocessing.Queue, and you cannot put a multiprocessing.Queue instance to a multiprocessing.Queue instance, which uses a ForkingPickler for serialization.
So how is the multiprocessing.Queue serialized when passed as an argument to a multiprocessing.Process that allows it to be used in this way?
I wanted to expand on the accepted answer so I added my own which also details a way to make queues, locks, etc. picklable and able to be sent through a pool.
Why this happens
Basically, it's not that Queues cannot be serialized, it's just that multiprocessing is only equipped to serialize these when it knows sufficient information about the target process it will be sent to (whether that be the current process or some else) which is why it works when you are spawning a process yourself (using Process class) but not when you are simply putting it in a queue (like when using a Pool).
Look over the source code for multiprocessing.queues.Queue (or other connection objects like Condition). You'll find that in their __getstate__ method (the method called when a Queue instance is being pickled), there is a call to function multiprocessing.context.assert_spawning. This "assertion" will only pass if the current thread is spawning a process. If that is not the case, multiprocessing raises the error you see and quits.
Now the reason why multiprocessing does not even bother to pickle the queue in case the assertion fails is that it does not have access to the Popen object created when a thread creates a subprocess (for windows, you can find this at multiprocessing.popen_spawn_win32.Popen). This object stores data about the target process including its pid and process handle. Multiprocessing requires this information because a Queue contains mutexes, and to successfully pickle and later rebuild these again, multiprocessing must call DuplicateHandle through winapi with the information from the Popen object. Without this object being present, multiprocessing does not know what to do and raises an error. So this is where our problem lies, but it is something fixable if we can teach multiprocessing a different approach to steal the duplicate handles from inside the target process itself without ever requiring it's information in advance.
Making Picklable Queues
Pay attention to the class multiprocessing.synchronize.SemLock. It's the base class for all multiprocessing locks, so its objects are subsequently present in queues, pipes, etc. The way it's currently pickled is like how I described above, it requires the target process's handle to create a duplicate handle. However, we can instead define a __reduce__ method for SemLock where we will create a duplicate handle using the current process's handle, and then from the target process, duplicate the previously created handle which will now be valid in the target process's context. It's quite a mouthful, but a similar approach is actually used to pickle PipeConnection objects as well, but instead of a __reduce__ method, it uses the dispatch table to do so.
After this is done, we can the subclass Queue and remove the call to assert_spawning since it will no longer be required. This way, we will now successfully be able to pickle locks, queues, pipes, etc. Here's the code with examples:
import os, pickle
from multiprocessing import Pool, Lock, synchronize, get_context
import multiprocessing.queues
import _winapi
def work(q):
print("Worker: Main says", q.get())
q.put('haha')
class DupSemLockHandle(object):
"""
Picklable wrapper for a handle. Attempts to mirror how PipeConnection objects are pickled using appropriate api
"""
def __init__(self, handle, pid=None):
if pid is None:
# We just duplicate the handle in the current process and
# let the receiving process steal the handle.
pid = os.getpid()
proc = _winapi.OpenProcess(_winapi.PROCESS_DUP_HANDLE, False, pid)
try:
self._handle = _winapi.DuplicateHandle(
_winapi.GetCurrentProcess(),
handle, proc, 0, False, _winapi.DUPLICATE_SAME_ACCESS)
finally:
_winapi.CloseHandle(proc)
self._pid = pid
def detach(self):
"""
Get the handle, typically from another process
"""
# retrieve handle from process which currently owns it
if self._pid == os.getpid():
# The handle has already been duplicated for this process.
return self._handle
# We must steal the handle from the process whose pid is self._pid.
proc = _winapi.OpenProcess(_winapi.PROCESS_DUP_HANDLE, False,
self._pid)
try:
return _winapi.DuplicateHandle(
proc, self._handle, _winapi.GetCurrentProcess(),
0, False, _winapi.DUPLICATE_CLOSE_SOURCE | _winapi.DUPLICATE_SAME_ACCESS)
finally:
_winapi.CloseHandle(proc)
def reduce_lock_connection(self):
sl = self._semlock
dh = DupSemLockHandle(sl.handle)
return rebuild_lock_connection, (dh, type(self), (sl.kind, sl.maxvalue, sl.name))
def rebuild_lock_connection(dh, t, state):
handle = dh.detach() # Duplicated handle valid in current process's context
# Create a new instance without calling __init__ because we'll supply the state ourselves
lck = t.__new__(t)
lck.__setstate__((handle,)+state)
return lck
# Add our own reduce function to pickle SemLock and it's child classes
synchronize.SemLock.__reduce__ = reduce_lock_connection
class PicklableQueue(multiprocessing.queues.Queue):
"""
A picklable Queue that skips the call to context.assert_spawning because it's no longer needed
"""
def __init__(self, *args, **kwargs):
ctx = get_context()
super().__init__(*args, **kwargs, ctx=ctx)
def __getstate__(self):
return (self._ignore_epipe, self._maxsize, self._reader, self._writer,
self._rlock, self._wlock, self._sem, self._opid)
def is_locked(l):
"""
Returns whether the given lock is acquired or not.
"""
locked = l.acquire(block=False)
if locked is False:
return True
else:
l.release()
return False
if __name__ == '__main__':
# Example that shows that you can now pickle/unpickle locks and they'll still point towards the same object
l1 = Lock()
p = pickle.dumps(l1)
l2 = pickle.loads(p)
print('before acquiring, l1 locked:', is_locked(l1), 'l2 locked', is_locked(l2))
l2.acquire()
print('after acquiring l1 locked:', is_locked(l1), 'l2 locked', is_locked(l2))
# Example that shows how you can pass a queue to Pool and it will work
with Pool() as pool:
q = PicklableQueue()
q.put('laugh')
pool.map(work, (q,))
print("Main: Worker says", q.get())
Output
before acquiring, l1 locked: False l2 locked False
after acquiring l1 locked: True l2 locked True
Worker: Main says laugh
Main: Worker says haha
Disclaimer: The above code will only work on Windows. If you are on UNIX then you may try using #Booboo's modified code below (reported working but has not been adequately tested, full code link here):
import os, pickle
from multiprocessing import Pool, Lock, synchronize, get_context, Process
import multiprocessing.queues
import sys
_is_windows= sys.platform == 'win32'
if _is_windows:
import _winapi
.
.
.
class DupSemLockHandle(object):
"""
Picklable wrapper for a handle. Attempts to mirror how PipeConnection objects are pickled using appropriate api
"""
def __init__(self, handle, pid=None):
if pid is None:
# We just duplicate the handle in the current process and
# let the receiving process steal the handle.
pid = os.getpid()
if _is_windows:
proc = _winapi.OpenProcess(_winapi.PROCESS_DUP_HANDLE, False, pid)
try:
self._handle = _winapi.DuplicateHandle(
_winapi.GetCurrentProcess(),
handle, proc, 0, False, _winapi.DUPLICATE_SAME_ACCESS)
finally:
_winapi.CloseHandle(proc)
else:
self._handle = handle
self._pid = pid
def detach(self):
"""
Get the handle, typically from another process
"""
# retrieve handle from process which currently owns it
if self._pid == os.getpid():
# The handle has already been duplicated for this process.
return self._handle
if not _is_windows:
return self._handle
# We must steal the handle from the process whose pid is self._pid.
proc = _winapi.OpenProcess(_winapi.PROCESS_DUP_HANDLE, False,
self._pid)
try:
return _winapi.DuplicateHandle(
proc, self._handle, _winapi.GetCurrentProcess(),
0, False, _winapi.DUPLICATE_CLOSE_SOURCE | _winapi.DUPLICATE_SAME_ACCESS)
finally:
_winapi.CloseHandle(proc)
When serializing a multiprocessing.Qeue to a multiprocessing.Process.run method it is not the queue itself that is being serialized. The queue is implemented by an opened pipe (the type depends on the platform), represented by a file descriptor, and a lock that serializes access to the pipe. It is the file descriptor and lock that are being serialized/de-serialized from which the original queue can then be reconstructed.

Real time multipocess stdout monitoring

Right now, I'm using subprocess to run a long-running job in the background. For multiple reasons (PyInstaller + AWS CLI) I can't use subprocess anymore.
Is there an easy way to achieve the same thing as below ? Running a long running python function in a multiprocess pool (or something else) and do real time processing of stdout/stderr ?
import subprocess
process = subprocess.Popen(
["python", "long-job.py"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
shell=True,
)
while True:
out = process.stdout.read(2000).decode()
if not out:
err = process.stderr.read().decode()
else:
err = ""
if (out == "" or err == "") and process.poll() is not None:
break
live_stdout_process(out)
Thanks
getting it cross platform is messy .... first of all windows implementation of non-blocking pipe is not user friendly or portable.
one option is to just have your application read its command line arguments and conditionally execute a file, and you get to use subprocess since you will be launching yourself with different argument.
but to keep it to multiprocessing :
the output must be logged to queues instead of pipes.
you need the child to execute a python file, this can be done using runpy to execute the file as __main__.
this runpy function should run under a multiprocessing child, this child must first redirect its stdout and stderr in the initializer.
when an error happens, your main application must catch it .... but if it is too busy reading the output it won't be able to wait for the error, so a child thread has to start the multiprocess and wait for the error.
the main process has to create the queues and launch the child thread and read the output.
putting it all together:
import multiprocessing
from multiprocessing import Queue
import sys
import concurrent.futures
import threading
import traceback
import runpy
import time
class StdoutQueueWrapper:
def __init__(self,queue:Queue):
self._queue = queue
def write(self,text):
self._queue.put(text)
def flush(self):
pass
def function_to_run():
# runpy.run_path("long-job.py",run_name="__main__") # run long-job.py
print("hello") # print something
raise ValueError # error out
def initializer(stdout_queue: Queue,stderr_queue: Queue):
sys.stdout = StdoutQueueWrapper(stdout_queue)
sys.stderr = StdoutQueueWrapper(stderr_queue)
def thread_function(child_stdout_queue,child_stderr_queue):
with concurrent.futures.ProcessPoolExecutor(1, initializer=initializer,
initargs=(child_stdout_queue, child_stderr_queue)) as pool:
result = pool.submit(function_to_run)
try:
result.result()
except Exception as e:
child_stderr_queue.put(traceback.format_exc())
if __name__ == "__main__":
child_stdout_queue = multiprocessing.Queue()
child_stderr_queue = multiprocessing.Queue()
child_thread = threading.Thread(target=thread_function,args=(child_stdout_queue,child_stderr_queue),daemon=True)
child_thread.start()
while True:
while not child_stdout_queue.empty():
var = child_stdout_queue.get()
print(var,end='')
while not child_stderr_queue.empty():
var = child_stderr_queue.get()
print(var,end='')
if not child_thread.is_alive():
break
time.sleep(0.01) # check output every 0.01 seconds
Note that a direct consequence of running as a multiprocess is that if the child runs into a segmentation fault or some unrecoverable error the parent will also die, hencing running yourself under subprocess might seem a better option if segfaults are expected.

How can I check that the Process class from Python Multiprocessing has worked?

I've written the following code which runs a function that simulates a stochastic simulation of a series of chemical reactions. I've written the following code:
v = range(1, 51)
def parallelfunc(*v):
gillespie_tau_leaping(start_state, LHS, stoch_rate, state_change_array)
def info(title):
print(title)
print('module name:', __name__)
print('parent process:', os.getppid())
print('process id:', os.getpid())
if __name__ == '__main__':
info('main line')
start = datetime.utcnow()
p = Process(target=parallelfunc, args=(v))
p.start()
p.join()
end = datetime.utcnow()
sim_time = end - start
print(f"Simualtion utc time:\n{sim_time}")
I'm using the Process method from the multiprocessing library and am trying to run gillespie_tau_leaping 50 times.
Only I'm not sure if its working. gillespie_tau_leaping prints out a number of values to the terminal, but these values are only printed out once, I'd expect them to be printed out 50 times.
I tried using the getpid etc command and this returns the following to the terminal:
main line
module name: __main__
parent process: 6188
process id: 27920
How can I tell if my code as worked and how can I get it to print the values from gillepsie_tau_leaping 50 times to the terminal?
Cheers
Your code is running just one process, the call to Process, spawns a new thread but you are doing it only once (not in a loop).
I would suggest you to use multiprocessing pools
Your code can be something like this:
from multiprocess import Pool
def parallelfunc(*args):
do_something()
def main():
# create a list of list of args for the function invocation
func_args = [['arg1call1', 'arg2call1', 'arg3call1'], ['arg1call2', 'arg2call2', 'arg3call2']]
with Pool() as p:
results = p.map(parallelfunc, func_args)
# do something with results which is a list of results
multiprocessing pool by default create the same number of processes as your CPU cores and manage the process Pool till the end of the processing taking care of all the Inter Process Communication.
This is really handy because synchronizing processes can be hard.
Hope this helps

Multiprocessing callback message

I have long running process, that I want to keep track about in which state it currently is in. There is N processes running in same time therefore multiprocessing issue.
I pass Queue into process to report messages about state, and this Queue is then read(if not empty) in thread every couple of second.
I'm using Spider on windows as environment and later described behavior is in its console. I did not try it in different env.
from multiprocessing import Process,Queue,Lock
import time
def test(process_msg: Queue):
try:
process_msg.put('Inside process message')
# process...
return # to have exitstate = 0
except Exception as e:
process_msg.put(e)
callback_msg = Queue()
if __name__ == '__main__':
p = Process(target = test,
args = (callback_msg,))
p.start()
time.sleep(5)
print(p)
while not callback_msg.empty():
msg = callback_msg.get()
if type(msg) != Exception:
tqdm.write(str(msg))
else:
raise msg
Problem is that whatever I do with code, it never reads what is inside the Queue(also because it never puts anything in it). Only when I switch to dummy version, which runs similary to threading on only 1 CPU from multiprocessing.dummy import Process,Queue,Lock
Apparently the test function have to be in separate file.

With python.multiprocessing, how do I create a proxy in the current process to pass to other processes?

I'm using the multiprocessing library in Python. I can see how to define that objects returned from functions should have proxies created, but I'd like to have objects in the current process turned into proxies so I can pass them as parameters.
For example, running the following script:
from multiprocessing import current_process
from multiprocessing.managers import BaseManager
class ProxyTest(object):
def call_a(self):
print 'A called in %s' % current_process()
def call_b(self, proxy_test):
print 'B called in %s' % current_process()
proxy_test.call_a()
class MyManager(BaseManager):
pass
MyManager.register('proxy_test', ProxyTest)
if __name__ == '__main__':
manager = MyManager()
manager.start()
pt1 = ProxyTest()
pt2 = manager.proxy_test()
pt1.call_a()
pt2.call_a()
pt1.call_b(pt2)
pt2.call_b(pt1)
... I get the following output ...
A called in <_MainProcess(MainProcess, started)>
A called in <Process(MyManager-1, started)>
B called in <_MainProcess(MainProcess, started)>
A called in <Process(MyManager-1, started)>
B called in <Process(MyManager-1, started)>
A called in <Process(MyManager-1, started)>
... but I want that final line of output coming from _MainProcess.
I could just create another Process and run it from there, but I'm trying to keep the amount of data that needs to be passed between processes to a minimum. The documentation for the Manager object mentioned a serve_forever method, but it doesn't seem to be supported. Any ideas?
Does anyone know?
Why do you say serve_forever is not supported?
manager = Mymanager()
s = manager.get_server()
s.serve_forever()
should work.
See managers.BaseManager.get_server doc for official examples.

Categories

Resources