Implementation details of Python multiprocessing fork vs spawn

Implementation details of Python multiprocessing fork vs spawn - python

According to the multiprocessing documentation on picklability, it states
Picklability
Ensure that the arguments to the methods of proxies are picklable.
More picklability
Ensure that all arguments to Process.init() are picklable. Also, if you subclass Process then make sure that instances will be picklable when the Process.start method is called.
I think it basically means that whatever is sent through arguments of Process will be pickled/unpickled.
But in Better to inherit than pickle/unpickle session, it states
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
I conducted the experiment which shows the output Read successfully..
def read_dataset(dataset, window):
return dataset.read(window=window)
if __name__ == "__main__":
mp.set_start_method("fork")
with rasterio.open(Path("test.tiff").absolute()) as dataset:
window = Window(col_off=0, row_off=0, width=100, height=100)
p1 = mp.Process(target=read_dataset, args=(dataset, window))
p1.start()
p1.join()
print("Read successfully.")
But when changing to mp.set_start_method("spawn"), it shows the error below.
Traceback (most recent call last):
File "test.py", line 88, in <module>
p1.start()
File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "stringsource", line 2, in rasterio._io.DatasetReaderBase.__reduce_cython__
TypeError: self._hds cannot be converted to a Python object for pickling
My question is the following.
When a child process is generated with fork, the variable is inherited instead of pickled/unpickled. But when a child process is generated with spawn, then the arguments are sent through pickling/unpickling. Where can I find such implementation detail? Thanks.

multiprocessing context
popen fork
spawn fork
fork shares a value in memory and starts it.
spawn is implemented by creating a cmd for the source code and sharing some variables through a pipe.
If you edit and save the source code just before spawning, you can get the result of the modified code.

Related

Why does multiprocessing serialisation with pickling depend on scope?

I am trying to understand how concurrency works in generall and, in this case, how does it work specifically in Python.
I have been using the inputs library for a while now and always had to "cheat" when spawning processes using it - execute the script with subprocess.Popen. Today I have, without much thought, placed a single line of code in a different place and managed to successfully spawn a Process targetting a function. But I don't understand why does it work...
The following code exposes two simple classes, one holds a reference to controller in self and the other one doesn't (and uses the global reference declared in the module):
import inputs
import multiprocessing
import time
controller = inputs.devices.gamepads[0]
class TestBroken:
def __init__(self):
self.controller = controller
def read(self):
while True:
ev = self.controller.read()[0]
print(ev.code, ev.state)
class TestWorking:
def read(self):
while True:
ev = controller.read()[0]
print(ev.code, ev.state)
if __name__ == '__main__':
t = TestWorking()
# Uncomment the line below to get the errors
#t = TestBroken()
multiprocessing.Process(target=t.read).start()
while True:
print("I'm alive!")
time.sleep(1)
The error after uncommenting #t = TestBroken() is as follows:
Traceback (most recent call last):
File "C:/Coding/...", line 31, in <module>
multiprocessing.Process(target=t.read).start()
File "C:\Python\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Python\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Python\lib\multiprocessing\context.py", line 326, in _Popen
return Popen(process_obj)
File "C:\Python\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Python\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'
I can't quite understand how storing a reference to an object is making pickle go bonkers while storing the same reference on module level is allowed. I kindly request your assistance to uncover the mysteries behind this issue.

While one starts multiprocessing, the global variables of the parent process is copied to as many child process created. Hence any change in the parent process shall not reflect on the copies of the child processes. The part of code in class TestBroken has a constructor which allocate the class variable to the global variable and later try to read it. While the class TestWorking just created a local variable ev to read the controller ( which is in global scope ).

pyserial with multiprocessing gives me a ctype error

Hi I'm trying to write a module that lets me read and send data via pyserial. I have to be able to read the data in parallel to my main script. With the help of a stackoverflow user, I have a basic and working skeleton of the program, but when I tried adding a class I created that uses pyserial (handles finding port, speed, etc) found here I get the following error:
File "<ipython-input-1-830fa23bc600>", line 1, in <module>
runfile('C:.../pythonInterface1/Main.py', wdir='C:/Users/Daniel.000/Desktop/Daniel/Python/pythonInterface1')
File "C:...\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:...\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Daniel.000/Desktop/Daniel/Python/pythonInterface1/Main.py", line 39, in <module>
p.start()
File "C:...\Anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:...\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:...\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:...\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "C:...\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
This is the code I am using to call the class in SerialConnection.py
import multiprocessing
from time import sleep
from operator import methodcaller
from SerialConnection import SerialConnection as SC
class Spawn:
def __init__(self, _number, _max):
self._number = _number
self._max = _max
# Don't call update here
def request(self, x):
print("{} was requested.".format(x))
def update(self):
while True:
print("Spawned {} of {}".format(self._number, self._max))
sleep(2)
if __name__ == '__main__':
'''
spawn = Spawn(1, 1) # Create the object as normal
p = multiprocessing.Process(target=methodcaller("update"), args=(spawn,)) # Run the loop in the process
p.start()
while True:
sleep(1.5)
spawn.request(2) # Now you can reference the "spawn"
'''
device = SC()
print(device.Port)
print(device.Baud)
print(device.ID)
print(device.Error)
print(device.EMsg)
p = multiprocessing.Process(target=methodcaller("ReadData"), args=(device,)) # Run the loop in the process
p.start()
while True:
sleep(1.5)
device.SendData('0003')
What am I doing wrong for this class to be giving me problems? Is there some form of restriction to use pyserial and multiprocessing together? I know it can be done but I don't understand how...
here is the traceback i get from python
Traceback (most recent call last): File "C:...\Python\pythonInterface1\Main.py", line 45, in <module>
p.start()
File "C:...\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:...\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:...\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:...\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)
File "C:...\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj) ValueError: ctypes objects containing pointers cannot be pickled

You are trying to pass a SerialConnection instance to another process as an argument. For that python has first to serialize (pickle) the object, and it is not possible for SerialConnection objects.
As said in Rob Streeting's answer, a possible solution would be to allow the SerialConnection object to be copied to the other process' memory using the fork that occurs when multiprocessing.Process.start is invoked, but this will not work on Windows as it does not use fork.
A simpler, cross-platform and more efficient way to achieve parallelism in your code would be to use a thread instead of a process. The changes to your code are minimal:
import threading
p = threading.Thread(target=methodcaller("ReadData"), args=(device,))

I think the problem is due to something inside device being unpicklable (i.e., not serializable by python). Take a look at this page to see if you can see any rules that may be broken by something in your device object.
So why does device need to be picklable at all?
When a multiprocessing.Process is started, it uses fork() at the operating system level (unless otherwise specified) to create the new process. What this means is that the whole context of the parent process is "copied" over to the child. This does not require pickling, as it's done at the operating system level.
(Note: On unix at least, this "copy" is actually a pretty cheap operation because it used a feature called "copy-on-write". This means that both parent and child processes actually read from the same memory until one or the other modifies it, at which point the original state is copied over to the child process.)
However, the arguments of the function that you want the process to take care of do have to be pickled, because they are not part of the main process's context. So, that includes your device variable.
I think you might be able to resolve your issue by allowing device to be copied as part of the fork operation rather than passing it in as a variable. To do this though, you'll need a wrapper function around the operation you want your process to do, in this case methodcaller("ReadData"). Something like this:
if __name__ == "__main__":
device = SC()
def call_read_data():
device.ReadData()
...
p = multiprocessing.Process(target=call_read_data) # Run the loop in the process
p.start()

Pika Consumer as a Python Process (multiprocessing)

I am trying to use the example Pika Async consumer (http://pika.readthedocs.io/en/0.10.0/examples/asynchronous_consumer_example.html) as a multiprocessing process (by making the ExampleConsumer class subclass multiprocessing.Process). However, I'm running into some issues with gracefully shutting down everything.
Let's say for example I have defined my procs as below:
for k, v in queues_callbacks.iteritems():
proc = ExampleConsumer(queue, k, v, rabbit_user, rabbit_pw, rabbit_host, rabbit_port)
"queues_callbacks" is basically just a dictionary of exchange : callback_function (ideally I'd like to be able to connect to several exchanges with this architecture).
Then I do the normal python way of dealing with starting processes:
try:
for proc in self.consumers:
proc.start()
for proc in self.consumers:
proc.join()
except KeyboardInterrupt:
for proc in self.consumers:
proc.terminate()
proc.join(1)
The issue is coming when I try to stop everything. Let's say I've overriden the "terminate" method to call the consumer's "stop" method then continue on with the normal terminate of Process. With this structure, I am getting some strange attribute errors
Traceback (most recent call last):
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 154, in <module>
main()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 150, in main
mybot.start()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 71, in start
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 53, in stop
self.__stop_consumers__()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 130, in __stop_consumers__
self.consumers[0].terminate()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 414, in terminate
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 399, in stop
self._connection.ioloop.start()
AttributeError: 'NoneType' object has no attribute 'ioloop'
It's as if these attributes somehow disappear at some point. In the particular case above, _connection is initialized as None, but then gets set when the Consumer is started. However, when the "stop" method is called, it has already reverted back to None (with nothing set to do so). I'm also observing other strange behavior, such as times when it appears that things are getting called twice (even though "stop" is called once). Any ideas as to what is going on here, or is this not the proper way of architecting this?
Thanks!

Iterate over list from leading and trailing with multiprocessing

I want to iterate over a list with 2 function using multiprocessing one function iterate over the main_list from leading and other from trailing, I want this function each time that iterates over the sample list (g) put the element in main list till one of them find a duplicate in list then I want the terminate both processes and return the seen elements.
I expect that the first process return :
['a', 'b', 'c', 'd', 'e', 'f']
And the second return :
['l', 'k', 'j', 'i', 'h', 'g']
this is my code that returns an Error:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.list()
# Fn definitions and such
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if i in main_path:
return l
main_path.append(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if i in main_path:
return l
main_path.append(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
g2=g[::-1]
p1 = Process(target=a, args=(d,g))
p2 = Process(target=b, args=(d,g2))
p1.start()
p2.start()
And this is the Traceback:
a
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/bluebird/Desktop/persiantext.py", line 17, in a
if i in main_path:
File "<string>", line 2, in __contains__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 755, in _callmethod
self._connect()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
b
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/bluebird/Desktop/persiantext.py", line 27, in b
if i in main_path:
File "<string>", line 2, in __contains__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 755, in _callmethod
self._connect()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory
Note that i have not any idea that how terminate both processes after that one of them find a duplicated element!!

There are all kinds of other problems in your code, but since I already explained them on your other question, I won't get into them here.
The new problem is that you're not joining your child processes. In your threaded version, this wasn't an issue just because your main thread accidentally had a "block forever" before the end. But here, you don't have that, so the main process reaches the end of the script while the background processes are still running.
When this happens, it's not entirely defined what your code will do.* But basically, you're destroying the manager object, which shuts down the manager server while the background processes are still using it, so they're going to raise exceptions the next time they try to access a managed object.
The solution is to add p1.join() and p2.join() to the end of your script.
But that really only gets you back to the same situation as your threaded code (except not blocking forever at the end). You've still got code that's completely serialized, and a big race condition, and so on.
If you're curious why this happens:
At the end of the script, all of your module's globals go out of scope.** Since those variables are the only reference you have to the manager and process objects, those objects get garbage-collected, and their destructors get called.
For a manager object, the destructor shuts down the server.
For a process object, I'm not entirely sure, but I think the destructor does nothing (rather than join it and/or interrupt it). Instead, there's an atexit function, that runs after all of the destructors, that joins any still-running processes.***
So, first the manager goes away, then the main process starts waiting for the children to finish; the next time each one tries to access a managed object, it fails and exits. Once all of them do that, the main process finishes waiting and exits.
* The multiprocessing changes in 3.2 and the shutdown changes in 3.4 make things a lot cleaner, so if we weren't talking about 2.7, there would be less "here's what usually happens but not always" and "here's what happens in one particular implementation on one particular platform".
** This isn't actually guaranteed by 2.7, and garbage-collecting all of the modules' globals doesn't always happen. But in this particular simple case, I'm pretty sure it will always work this way, at least in CPython, although I don't want to try to explain why.
*** That's definitely how it works with threads, at least on CPython 2.7 on Unix… again, this isn't at all documented in 2.x, so you can only tell by reading the source or experimenting on the platforms/implementations/versions that matter to you… And I don't want to track this through the source unless there's likely to be something puzzling or interesting to find.

Storing a process object in multiprocessing Manager.dict()

I have created this sample program to generalize the issue i am facing
import multiprocessing
from multiprocessing import Manager
def f (_print):
print _print
manager = multiprocessing.Manager()
dict = manager.dict()
dict['process_obj'] = multiprocessing.current_process()
print dict
if __name__ == '__main__':
process = multiprocessing.Process(target=f, args= ('hello function', ))
process.start()
process.join()
So how do I store a process object in multiprocessing Manager.dict()?

I assume you're talking about getting this error:
hello function
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mp2.py", line 8, in f
dict['process_obj'] = multiprocessing.current_process()
File "<string>", line 2, in __setitem__
File "/usr/local/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
(it's generally a good idea to include "what I got" and "what I expected to get instead" in the question).
The fundamental problem here is that multiprocessing.current_process() returns an instance method. Instance methods don't pickle properly, and multiprocessing has to save (pickle) and load (unpickle) shared data items to communicate their values from one process to another. See, e.g., Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map() and Overcoming Python's limitations regarding instance methods. Note in particular one of the answers in the second: it might be better to figure out some state to send/share, rather than an entire instance. For instance, if the ident of a process suffices, you can do this:
dict['process_obj'] = multiprocessing.current_process().ident
which works fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementation details of Python multiprocessing fork vs spawn - python

multiprocessing context popen fork spawn fork fork shares a value in memory and starts it. spawn is implemented by creating a cmd for the source code and sharing some variables through a pipe. If you edit and save the source code just before spawning, you can get the result of the modified code.

Related

Why does multiprocessing serialisation with pickling depend on scope?

pyserial with multiprocessing gives me a ctype error

Pika Consumer as a Python Process (multiprocessing)

Iterate over list from leading and trailing with multiprocessing

Storing a process object in multiprocessing Manager.dict()

Categories

Resources