I would like be able to create new multiprocessing.Value or multiprocessing.Array after process start. Like in this example:
# coding: utf-8
import multiprocessing
shared = {
'foo': multiprocessing.Value('i', 42),
}
def job(pipe):
while True:
shared_key = pipe.recv()
print(shared[shared_key].value)
process_read_pipe, process_write_pipe = multiprocessing.Pipe(duplex=False)
process = multiprocessing.Process(
target=job,
args=(process_read_pipe, )
)
process.start()
process_write_pipe.send('foo')
shared['bar'] = multiprocessing.Value('i', 24)
process_write_pipe.send('bar')
Ouput:
42
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/bux/Projets/synergine2/p.py", line 12, in job
print(shared[shared_key].value)
KeyError: 'bar'
Process finished with exit code 0
Problem here is: shared dict is copied into process when it's start. But, if i add a key in shared dict, process can't see it. How this started process can be informed about existence of new multiprocessing.Value('i', 24) ?
It can't be given thought pipe because:
Synchronized objects should only be shared between processes through inheritance
Any idea ?
It looks like you are assuming the shared variable is accessable by both threads. Only the shared["foo"] variable is accessable by both threads. You need to share a dictionary.
Here is an example: Python multiprocessing: How do I share a dict among multiple processes?
Related
When I run the below code:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
q = Queue()
def my_task(x, queue):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i, q) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
I get this error:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/nn.py", line 14, in <module>
print(task.result())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
What is the purpose of multiprocessing.Queue if I cannot using for multiprocessing? How can I make this to work? In my real code, I need every worker to update a queue frequently about the task status so another thread will get data from that queue to feed a progress bar.
Short Explanation
Why can't you pass a multiprocessing.Queue as a worker function argument? The short answer is that submitted tasks are submitted to a transparent input queue from which the pool processes get the next task to be performed. But these arguments must be serializable with pickle and a multiprocessing.Queue is not in general serializable. But it is serializable for the special case of passing an argument to a child process as a function argument. Arguments to a multiprocessing.Process are stored as an attribute of the instance when it is created. When start is called on the instance, its state must be serialized to the new address space before the run method is called in that new address space. Why this serialization works for this case but not the general case is unclear to me; I would have to spend a lot of time looking at the source for the interpreter to come up with a definitive answer.
See what happens when I try to put a queue instance to a queue:
>>> from multiprocessing import Queue
>>> q1 = Queue()
>>> q2 = Queue()
>>> q1.put(q2)
>>> Traceback (most recent call last):
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Program Files\Python38\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>> import pickle
>>> b = pickle.dumps(q2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>>
How to Pass the Queue via Inheritance
First of all your code will run more slowly using multiprocessing then if you had just called my_task in a loop because multiprocessing introduces additional overhead (starting of processes and moving data across address spaces) which requires that what you gain from running my_task in parallel more than offsets the additional overhead. In your case it doesn't because my_task is not sufficiently CPU-intensive as to justify multiprocessing.
That said, when you wish to have your pool processes using a multiprocessing.Queue instance, it cannot be passed as an argument to a worker function (unlike the case when you are using explicitly multiprocessing.Process instances instead of a pool). Instead, you must initialize a global variable in each pool process with the queue instance.
If you are running under a platform that uses fork to create new processes, then you can just create queue as a global and it will be inherited by each pool process:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
queue = Queue()
def my_task(x):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(queue.get())
Prints:
1
0
2
3
6
5
4
7
8
9
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
If you need portability with platforms that do not use the fork method to create processes, such as Windows (which uses the spawn method), then you cannot allocate the queue as a global since each pool process will create its own queue instance. Instead, the main process must create the queue and then initialize each pool process' global queue variable by using the initializer and initargs:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
def init_pool_processes(q):
global queue
queue = q
def my_task(x):
queue.put("Task Complete")
return x
# Windows compatibilitY
if __name__ == '__main__':
q = Queue()
with ProcessPoolExecutor(initializer=init_pool_processes, initargs=(q,)) as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(q.get())
If you want to advance a progress bar as each task completes (you haven't precisely stated how the bar is to advance; see my comment to your question), then the following shows that a queue is necessary. If each task submitted consisted of N parts (for a total of 10 * N parts, since there are 10 tasks) and would like to see a single progress bar advance as each part is completed, then a queue is probably the most straight forward way of signaling a part completion back to the main process.
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
def my_task(x):
return x
# Windows compatibilitY
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
with tqdm(total=10) as bar:
tasks = [executor.submit(my_task, i) for i in range(10)]
for _ in as_completed(tasks):
bar.update()
# To get the results in task submission order:
results = [task.result() for task in tasks]
print(results)
Apologies if the title is not descriptive and I struggle to find a good title for the question.
My question involves Python, CouchDB (to a lesser degree), multiprocessing and networking. It started out as I was trying to debug a co-worker's program using Python's multiprocessing module to parallelize requests to a CouchDB database using couchdb-python. I created a minimal program to exhibit the bug and eventually solved the issue, but the solution drew another question which I was not able to answer to my best knowledge. I'm hoping experts on SO could help me with this, so here it goes.
The premise of the problem is pretty simple. We have n resources, all of which can be retrieved concurrently. Instead of making n serial requests, my co-worker is using the multiprocessing module to fetch all n resources in parallel. Here's a program I wrote to demonstrate the issue:
The Script (bug.py)
import couchdb
import multiprocessing
server = couchdb.Server(SERVER)
try:
database = server.create('test')
except:
server.delete('test')
database = server.create('test')
database.save({'_id': '1', 'type': 'dog', 'name': 'chase'})
database.save({'_id': '2', 'type': 'dog', 'name': 'rubble'})
database.save({'_id': '3', 'type': 'cat', 'name': 'kali'})
def query_id(id):
print(dict(database[id]))
def main():
args = [
['dog', 'chase'],
['dog', 'rubble'],
['cat', 'kali'],
]
print('-' * 80)
processes = []
for id_ in ['1', '2', '3']:
proc = multiprocessing.Process(target=query_id, args=(id_))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
if __name__ == '__main__':
main()
Pretty innocent code, right? Well, running it on the latest couchdb and couchdb-python gives the following error:
The output
--------------------------------------------------------------------------------
Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
File "bug.py", line 25, in query_id
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
return Document(data)
TypeError: 'ResponseBody' object is not iterable
return Document(data)
TypeError: 'ResponseBody' object is not iterable
After some digging, I finally found out that couchdb-python's implementation of ConnectionPool is not multiprocess safe. See this PR for more details. Basically, all processes share the same ConnectionPool object, and was given the same httplib.HTTPConnection object, and when they all simultaneously try to read from the connection, the string being returned is garbled, and the bug ensued. You can see the evidence of it if you put print(os.getpid(), line) inside httplib.HTTPResponse._read_status method. Here's a sample output after the print statement is added:
(26490, 'TP1.120 O\r\n')
(26489, 'T/ 0KServer: CouchDB/1.6.1 (Erlang OTP/17)\r\n')
Process Process-2:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
return Document(data)
TypeError: 'ResponseBody' object is not iterable
return Document(data)
TypeError: 'ResponseBody' object is not iterable
As seen here, the first line being read from the sub-processes are only partial, indicating a race condition here. If I further inspect the HTTPConnection object, I can see all three processes are sharing the same connection object, the socket to the server and the file descriptor from the socket that's being used for reading.
Puzzle
So far so good. I've identified the root cause of the problem and put together a fix. However, complication arises when I put the couchdb instance behind a reverse proxy. In this case, I'm using haproxy. Here's a sample config:
global
...
defaults
...
listen couchdb
bind *:9999
mode http
stats enable
option httpclose
option forwardfor
server couchdb-1 127.0.0.1:5984 check
and point the couchdb server url to http://localhost:9999 in the bug script, reran the script, and everything was fine! I also inspected the connection object, the socket and the file descriptor, and there were also shared among all processes.
This got me puzzled. I brought up mitmproxy and inspected what's going on in the two cases: with or without haproxy.
Without haproxy
When the parallel requests are made without haproxy, I observed in the mitmproxy details tab (I showed a single request, but the timing sequence is the same for all 3 concurrent requests):
The event sequence here suggests a blocking synchronous request.
With haproxy
You can see the sequence here is different from that without haproxy. Request is considered complete without the server connection being initiated.
Question
I'm not used to working at this low level so I know my knowledge in this regard is pretty lacking here. I want to understand what difference did putting haproxy in front of it brought that subverted the multiprocessing bug in couchdb-python? haproxy is event-based, so I suspect that has something to do with it, but would really appreciate someone explaining the difference!
Thanks a bunch in advance!
I'm trying to make a multiprocessing Queue in Python 2.7 that fills up to it's maxsize with processes, and then while there are more processes to be done that haven't yet been put into the Queue, will refill the Queue when any of the current procs finish. I'm trying to maximize performance so size of the Queue is numCores on the PC so each core is always doing work (ideally CPU will be at 100% use the whole time). I'm also trying to avoid context switching which is why I only want this many in the Queue at any time.
Example would be, say there are 50 tasks to be done, the CPU has 4 cores, so the Queue will be maxsize 4. We start by filling Queue with 4 processes, and immediately upon any of those 4 finishing (at which time there will be 3 in the Queue), a new proc is generated and sent to the queue. It continues doing this until all 50 tasks have been generated and completed.
This task is proving to be difficult since I'm new to multiprocessing, and also it seems the join() function will not work for me since that forces a blocking statement until ALL of the procs in the Queue have completed, which is NOT what I want.
Here is my code right now:
def queuePut(q, thread):
q.put(thread)
def launchThreads(threadList, performanceTestList, resultsPath, cofluentExeName):
numThreads = len(threadList)
threadsLeft = numThreads
print "numThreads: " + str(numThreads)
cpuCount = multiprocessing.cpu_count()
q = multiprocessing.Queue(maxsize=cpuCount)
count = 0
while count != numThreads:
while not q.full():
thread = threadList[numThreads - threadsLeft]
p = multiprocessing.Process(target=queuePut, args=(q,thread))
print "Starting thread " + str(numThreads - threadsLeft)
p.start()
threadsLeft-=1
count +=1
if(threadsLeft == 0):
threadsLeft+=1
break
Here is where it gets called in code:
for i in testNames:
p = multiprocessing.Process(target=worker,args=(i,paths[0],cofluentExeName,))
jobs.append(p)
launchThreads(jobs, testNames, testDirectory, cofluentExeName)
The procs seem to get created and put into the queue, for an example where there are 12 tasks and 40 cores, the output is as follows, proceeded by the error below:
numThreads: 12
Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Starting thread 5
Starting thread 6
Starting thread 7
Starting thread 8
Starting thread 9
Starting thread 10
Starting thread 11
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons
Why don't you use a multiprocessing Pool to accomplish this?
import multiprocessing
pool = multiprocessing.Pool()
pool.map(your_function, dataset) ##dataset is a list; could be other iterable object
pool.close()
pool.join()
The multiprocessing.Pool() can have the argument processes=# where you specify the # of jobs you want to start. If you don't specify this parameter, it will start as many jobs as you have cores (so if you have 4 cores, 4 jobs). When one job finishes it'll automatically start the next one; you don't have to manage that.
Multiprocessing: https://docs.python.org/2/library/multiprocessing.html
I am trying to use the example Pika Async consumer (http://pika.readthedocs.io/en/0.10.0/examples/asynchronous_consumer_example.html) as a multiprocessing process (by making the ExampleConsumer class subclass multiprocessing.Process). However, I'm running into some issues with gracefully shutting down everything.
Let's say for example I have defined my procs as below:
for k, v in queues_callbacks.iteritems():
proc = ExampleConsumer(queue, k, v, rabbit_user, rabbit_pw, rabbit_host, rabbit_port)
"queues_callbacks" is basically just a dictionary of exchange : callback_function (ideally I'd like to be able to connect to several exchanges with this architecture).
Then I do the normal python way of dealing with starting processes:
try:
for proc in self.consumers:
proc.start()
for proc in self.consumers:
proc.join()
except KeyboardInterrupt:
for proc in self.consumers:
proc.terminate()
proc.join(1)
The issue is coming when I try to stop everything. Let's say I've overriden the "terminate" method to call the consumer's "stop" method then continue on with the normal terminate of Process. With this structure, I am getting some strange attribute errors
Traceback (most recent call last):
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 154, in <module>
main()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 150, in main
mybot.start()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 71, in start
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 53, in stop
self.__stop_consumers__()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 130, in __stop_consumers__
self.consumers[0].terminate()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 414, in terminate
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 399, in stop
self._connection.ioloop.start()
AttributeError: 'NoneType' object has no attribute 'ioloop'
It's as if these attributes somehow disappear at some point. In the particular case above, _connection is initialized as None, but then gets set when the Consumer is started. However, when the "stop" method is called, it has already reverted back to None (with nothing set to do so). I'm also observing other strange behavior, such as times when it appears that things are getting called twice (even though "stop" is called once). Any ideas as to what is going on here, or is this not the proper way of architecting this?
Thanks!
I have created this sample program to generalize the issue i am facing
import multiprocessing
from multiprocessing import Manager
def f (_print):
print _print
manager = multiprocessing.Manager()
dict = manager.dict()
dict['process_obj'] = multiprocessing.current_process()
print dict
if __name__ == '__main__':
process = multiprocessing.Process(target=f, args= ('hello function', ))
process.start()
process.join()
So how do I store a process object in multiprocessing Manager.dict()?
I assume you're talking about getting this error:
hello function
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mp2.py", line 8, in f
dict['process_obj'] = multiprocessing.current_process()
File "<string>", line 2, in __setitem__
File "/usr/local/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
(it's generally a good idea to include "what I got" and "what I expected to get instead" in the question).
The fundamental problem here is that multiprocessing.current_process() returns an instance method. Instance methods don't pickle properly, and multiprocessing has to save (pickle) and load (unpickle) shared data items to communicate their values from one process to another. See, e.g., Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map() and Overcoming Python's limitations regarding instance methods. Note in particular one of the answers in the second: it might be better to figure out some state to send/share, rather than an entire instance. For instance, if the ident of a process suffices, you can do this:
dict['process_obj'] = multiprocessing.current_process().ident
which works fine.