Storing a process object in multiprocessing Manager.dict()

Storing a process object in multiprocessing Manager.dict() - python

I have created this sample program to generalize the issue i am facing
import multiprocessing
from multiprocessing import Manager
def f (_print):
print _print
manager = multiprocessing.Manager()
dict = manager.dict()
dict['process_obj'] = multiprocessing.current_process()
print dict
if __name__ == '__main__':
process = multiprocessing.Process(target=f, args= ('hello function', ))
process.start()
process.join()
So how do I store a process object in multiprocessing Manager.dict()?

I assume you're talking about getting this error:
hello function
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mp2.py", line 8, in f
dict['process_obj'] = multiprocessing.current_process()
File "<string>", line 2, in __setitem__
File "/usr/local/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
(it's generally a good idea to include "what I got" and "what I expected to get instead" in the question).
The fundamental problem here is that multiprocessing.current_process() returns an instance method. Instance methods don't pickle properly, and multiprocessing has to save (pickle) and load (unpickle) shared data items to communicate their values from one process to another. See, e.g., Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map() and Overcoming Python's limitations regarding instance methods. Note in particular one of the answers in the second: it might be better to figure out some state to send/share, rather than an entire instance. For instance, if the ident of a process suffices, you can do this:
dict['process_obj'] = multiprocessing.current_process().ident
which works fine.

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

When I run the below code:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
q = Queue()
def my_task(x, queue):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i, q) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
I get this error:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/nn.py", line 14, in <module>
print(task.result())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
What is the purpose of multiprocessing.Queue if I cannot using for multiprocessing? How can I make this to work? In my real code, I need every worker to update a queue frequently about the task status so another thread will get data from that queue to feed a progress bar.

Short Explanation
Why can't you pass a multiprocessing.Queue as a worker function argument? The short answer is that submitted tasks are submitted to a transparent input queue from which the pool processes get the next task to be performed. But these arguments must be serializable with pickle and a multiprocessing.Queue is not in general serializable. But it is serializable for the special case of passing an argument to a child process as a function argument. Arguments to a multiprocessing.Process are stored as an attribute of the instance when it is created. When start is called on the instance, its state must be serialized to the new address space before the run method is called in that new address space. Why this serialization works for this case but not the general case is unclear to me; I would have to spend a lot of time looking at the source for the interpreter to come up with a definitive answer.
See what happens when I try to put a queue instance to a queue:
>>> from multiprocessing import Queue
>>> q1 = Queue()
>>> q2 = Queue()
>>> q1.put(q2)
>>> Traceback (most recent call last):
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Program Files\Python38\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>> import pickle
>>> b = pickle.dumps(q2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>>
How to Pass the Queue via Inheritance
First of all your code will run more slowly using multiprocessing then if you had just called my_task in a loop because multiprocessing introduces additional overhead (starting of processes and moving data across address spaces) which requires that what you gain from running my_task in parallel more than offsets the additional overhead. In your case it doesn't because my_task is not sufficiently CPU-intensive as to justify multiprocessing.
That said, when you wish to have your pool processes using a multiprocessing.Queue instance, it cannot be passed as an argument to a worker function (unlike the case when you are using explicitly multiprocessing.Process instances instead of a pool). Instead, you must initialize a global variable in each pool process with the queue instance.
If you are running under a platform that uses fork to create new processes, then you can just create queue as a global and it will be inherited by each pool process:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
queue = Queue()
def my_task(x):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(queue.get())
Prints:
1
0
2
3
6
5
4
7
8
9
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
If you need portability with platforms that do not use the fork method to create processes, such as Windows (which uses the spawn method), then you cannot allocate the queue as a global since each pool process will create its own queue instance. Instead, the main process must create the queue and then initialize each pool process' global queue variable by using the initializer and initargs:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
def init_pool_processes(q):
global queue
queue = q
def my_task(x):
queue.put("Task Complete")
return x
# Windows compatibilitY
if __name__ == '__main__':
q = Queue()
with ProcessPoolExecutor(initializer=init_pool_processes, initargs=(q,)) as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(q.get())
If you want to advance a progress bar as each task completes (you haven't precisely stated how the bar is to advance; see my comment to your question), then the following shows that a queue is necessary. If each task submitted consisted of N parts (for a total of 10 * N parts, since there are 10 tasks) and would like to see a single progress bar advance as each part is completed, then a queue is probably the most straight forward way of signaling a part completion back to the main process.
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
def my_task(x):
return x
# Windows compatibilitY
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
with tqdm(total=10) as bar:
tasks = [executor.submit(my_task, i) for i in range(10)]
for _ in as_completed(tasks):
bar.update()
# To get the results in task submission order:
results = [task.result() for task in tasks]
print(results)

Implementation details of Python multiprocessing fork vs spawn

According to the multiprocessing documentation on picklability, it states
Picklability
Ensure that the arguments to the methods of proxies are picklable.
More picklability
Ensure that all arguments to Process.init() are picklable. Also, if you subclass Process then make sure that instances will be picklable when the Process.start method is called.
I think it basically means that whatever is sent through arguments of Process will be pickled/unpickled.
But in Better to inherit than pickle/unpickle session, it states
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
I conducted the experiment which shows the output Read successfully..
def read_dataset(dataset, window):
return dataset.read(window=window)
if __name__ == "__main__":
mp.set_start_method("fork")
with rasterio.open(Path("test.tiff").absolute()) as dataset:
window = Window(col_off=0, row_off=0, width=100, height=100)
p1 = mp.Process(target=read_dataset, args=(dataset, window))
p1.start()
p1.join()
print("Read successfully.")
But when changing to mp.set_start_method("spawn"), it shows the error below.
Traceback (most recent call last):
File "test.py", line 88, in <module>
p1.start()
File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "stringsource", line 2, in rasterio._io.DatasetReaderBase.__reduce_cython__
TypeError: self._hds cannot be converted to a Python object for pickling
My question is the following.
When a child process is generated with fork, the variable is inherited instead of pickled/unpickled. But when a child process is generated with spawn, then the arguments are sent through pickling/unpickling. Where can I find such implementation detail? Thanks.

multiprocessing context
popen fork
spawn fork
fork shares a value in memory and starts it.
spawn is implemented by creating a cmd for the source code and sharing some variables through a pipe.
If you edit and save the source code just before spawning, you can get the result of the modified code.

Why does multiprocessing not working when opening a file?

As I am trying out the multiprocessing pool module, I noticed that it does not work when I am loading / opening any kind of file. The code below works as expected. When I uncomment lines 8-9, the script skips the pool.apply_async method, and loopingTest never runs.
import time
from multiprocessing import Pool
class MultiClass:
def __init__(self):
file = 'test.txt'
# with open(file, 'r') as f: # This is the culprit
# self.d = f
self.n = 50000000
self.cases = ['1st time', '2nd time']
self.multiProc(self.cases)
print("It's done")
def loopingTest(self, cases):
print(f"looping start for {cases}")
n = self.n
while n > 0:
n -= 1
print(f"looping done for {cases}")
def multiProc(self, cases):
test = False
pool = Pool(processes=2)
if not test:
for i in cases:
pool.apply_async(self.loopingTest, (i,))
pool.close()
pool.join()
if __name__ == '__main__':
start = time.time()
w = MultiClass()
end = time.time()
print(f'Script finished in {end - start} seconds')

You see this behavior because calling apply_async fails when you save the file descriptor (self.d) to your instance. When you call apply_async(self.loopingTest, ...), Python needs to pickle self.loopingTest to send it to the worker process, which also requires pickling self. When you have the open file descriptor saved as a property of self, the pickling fails, because file descriptors can't be pickled. You'll see this for yourself if you use apply instead of apply_async in your sample code. You'll get an error like this:
Traceback (most recent call last):
File "a.py", line 36, in <module>
w = MultiClass()
File "a.py", line 12, in __init__
self.multiProc(self.cases)
File "a.py", line 28, in multiProc
out.get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
You need to change your code either avoiding saving the file descriptor to self, only create it in the worker method (if that's where you need to use it), or by using the tools Python provides to control the pickle/unpickle process for your class. Depending on the use-case, you can also turn the method you're passing to apply_async into a top-level function, so that self doesn't need to be pickled at all.

Why would it throws "'module' object has no attribute XXX" error when I call on apply_async from multiprocessing.Pool?

The code is as below. When I copy-and-paste it in my cmd prompt, it throws 'module' object has no attribute 'func', but when I save it as a .py file and execute python test.py, it just works fine.
import multiprocessing
import time
def func(msg):
for i in xrange(3):
print msg
time.sleep(1)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
for i in xrange(5):
msg = "hello %d" %(i)
pool.apply_async(func, (msg, ))
pool.close()
pool.join()
print "Sub-process(es) done."
Could anyone give me an explanation on the difference between in prompt and in file when running a python code? Thanks a lot!

This is happening because on Windows, func needs to be pickled and sent to the child process via IPC. In order for the child to unpickle func, it needs to be able to import it from the parent's __main__ module. When this happens in a normal Python script, the child can re-import your script, and __main__ will contain all the functions declared at the top-level of your script, so it works fine. However, in the interactive interpreter, functions you've defined while in the interpreter can't simply be re-imported from a file like in a normal script, so they will not be in __main__ in the child. This is more clear if you use multiprocessing.Process directly to recreate the issue:
>>> def f():
... print "HI"
...
>>> import multiprocessing
>>> p = multiprocessing.Process(target=f)
>>> p.start()
>>> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\python27\lib\multiprocessing\forking.py", line 381, in main
self = load(from_parent)
File "C:\python27\lib\pickle.py", line 1378, in load
return Unpickler(file).load()
File "C:\python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\python27\lib\pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "C:\python27\lib\pickle.py", line 1126, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'f'
This way, it's more clear that pickle can't find the module. If you add some tracing to pickle.py you can see that 'module' is referring to __main__:
def load_global(self):
module = self.readline()[:-1]
name = self.readline()[:-1]
print("module {} name {}".format(module, name)) # I added this.
klass = self.find_class(module, name)
self.append(klass)
Rrerunning the same code again with that extra print statement yields this:
module multiprocessing.process name Process
module __main__ name f
< same traceback as before>
It's worth noting that this example actually works fine on Posix platforms, because os.fork() is used to spawn the child processes, which means that any function defined prior to the Pool being created will be available in the child's __main__ module. So, while the above example will work, this one will still fail, because the worker function is defined after creating the Pool (which means after os.fork() is called):
>>> import multiprocessing
>>> p = multiprocessing.Pool(2)
>>> def f(a):
... print(a)
...
>>> p.apply(f, "hi")
Process PoolWorker-1:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 231, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/usr/lib64/python2.6/multiprocessing/queues.py", line 339, in get
return recv()
AttributeError: 'module' object has no attribute 'f'

Python Pool Map() Gives ether Pickle Error Or Does Not Iterate List Properly

I have a function that takes a list of urls and adds a header to each url. The url_list can be about 25,000 long lists. So, I want to use multiprocessing. I have tried 2 ways both with failure:
First way- the url_list is not passing correctly...the function only gets the first letter 'h' of the url_list url:
headers = {}
header_token = {}
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(do_it, url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "head.py", line 95, in <module>
pool.map(do_it, url_list)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
requests.exceptions.MissingSchema: Invalid URL u'h': No schema supplied
The second way...the way I prefer since I don't have to make headers dictionary global. But I get a pickle error:
def wrapper(headers):
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
return do_it
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(wrapper(headers), url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

If you are looking to use your second implementation, then I think you should be able to use dill to serialize your wrapper function. Dill can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails. Dill has the same interface as python's pickle, but also provides some additional methods. If you want to use dill for serialization with multiprocessing, all you have to do is:
>>> import dill
>>> # your code goes here (as above)
And, if that doesn't work for some reason, you could swap out multiprocessing with pathos... which is built to do multiprocessing using dill -- and provides a multi-*args map function (exactly like the standard python map).

You need to use a Queue from the multiprocessing package. The datatype that your pulling from or adding to needs to be thread and process safe; a Queue is both.
http://docs.python.org/2/library/queue.html
http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing a process object in multiprocessing Manager.dict() - python

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

Implementation details of Python multiprocessing fork vs spawn

Why does multiprocessing not working when opening a file?

Why would it throws "'module' object has no attribute XXX" error when I call on apply_async from multiprocessing.Pool?

Python Pool Map() Gives ether Pickle Error Or Does Not Iterate List Properly

Categories

Resources