Python Multiprocessing: Topping off multiprocessing queue before becoming empty

Python Multiprocessing: Topping off multiprocessing queue before becoming empty - python

I'm trying to make a multiprocessing Queue in Python 2.7 that fills up to it's maxsize with processes, and then while there are more processes to be done that haven't yet been put into the Queue, will refill the Queue when any of the current procs finish. I'm trying to maximize performance so size of the Queue is numCores on the PC so each core is always doing work (ideally CPU will be at 100% use the whole time). I'm also trying to avoid context switching which is why I only want this many in the Queue at any time.
Example would be, say there are 50 tasks to be done, the CPU has 4 cores, so the Queue will be maxsize 4. We start by filling Queue with 4 processes, and immediately upon any of those 4 finishing (at which time there will be 3 in the Queue), a new proc is generated and sent to the queue. It continues doing this until all 50 tasks have been generated and completed.
This task is proving to be difficult since I'm new to multiprocessing, and also it seems the join() function will not work for me since that forces a blocking statement until ALL of the procs in the Queue have completed, which is NOT what I want.
Here is my code right now:
def queuePut(q, thread):
q.put(thread)
def launchThreads(threadList, performanceTestList, resultsPath, cofluentExeName):
numThreads = len(threadList)
threadsLeft = numThreads
print "numThreads: " + str(numThreads)
cpuCount = multiprocessing.cpu_count()
q = multiprocessing.Queue(maxsize=cpuCount)
count = 0
while count != numThreads:
while not q.full():
thread = threadList[numThreads - threadsLeft]
p = multiprocessing.Process(target=queuePut, args=(q,thread))
print "Starting thread " + str(numThreads - threadsLeft)
p.start()
threadsLeft-=1
count +=1
if(threadsLeft == 0):
threadsLeft+=1
break
Here is where it gets called in code:
for i in testNames:
p = multiprocessing.Process(target=worker,args=(i,paths[0],cofluentExeName,))
jobs.append(p)
launchThreads(jobs, testNames, testDirectory, cofluentExeName)
The procs seem to get created and put into the queue, for an example where there are 12 tasks and 40 cores, the output is as follows, proceeded by the error below:
numThreads: 12
Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Starting thread 5
Starting thread 6
Starting thread 7
Starting thread 8
Starting thread 9
Starting thread 10
Starting thread 11
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons

Why don't you use a multiprocessing Pool to accomplish this?
import multiprocessing
pool = multiprocessing.Pool()
pool.map(your_function, dataset) ##dataset is a list; could be other iterable object
pool.close()
pool.join()
The multiprocessing.Pool() can have the argument processes=# where you specify the # of jobs you want to start. If you don't specify this parameter, it will start as many jobs as you have cores (so if you have 4 cores, 4 jobs). When one job finishes it'll automatically start the next one; you don't have to manage that.
Multiprocessing: https://docs.python.org/2/library/multiprocessing.html

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

When I run the below code:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
q = Queue()
def my_task(x, queue):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i, q) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
I get this error:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/nn.py", line 14, in <module>
print(task.result())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
What is the purpose of multiprocessing.Queue if I cannot using for multiprocessing? How can I make this to work? In my real code, I need every worker to update a queue frequently about the task status so another thread will get data from that queue to feed a progress bar.

Short Explanation
Why can't you pass a multiprocessing.Queue as a worker function argument? The short answer is that submitted tasks are submitted to a transparent input queue from which the pool processes get the next task to be performed. But these arguments must be serializable with pickle and a multiprocessing.Queue is not in general serializable. But it is serializable for the special case of passing an argument to a child process as a function argument. Arguments to a multiprocessing.Process are stored as an attribute of the instance when it is created. When start is called on the instance, its state must be serialized to the new address space before the run method is called in that new address space. Why this serialization works for this case but not the general case is unclear to me; I would have to spend a lot of time looking at the source for the interpreter to come up with a definitive answer.
See what happens when I try to put a queue instance to a queue:
>>> from multiprocessing import Queue
>>> q1 = Queue()
>>> q2 = Queue()
>>> q1.put(q2)
>>> Traceback (most recent call last):
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Program Files\Python38\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>> import pickle
>>> b = pickle.dumps(q2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>>
How to Pass the Queue via Inheritance
First of all your code will run more slowly using multiprocessing then if you had just called my_task in a loop because multiprocessing introduces additional overhead (starting of processes and moving data across address spaces) which requires that what you gain from running my_task in parallel more than offsets the additional overhead. In your case it doesn't because my_task is not sufficiently CPU-intensive as to justify multiprocessing.
That said, when you wish to have your pool processes using a multiprocessing.Queue instance, it cannot be passed as an argument to a worker function (unlike the case when you are using explicitly multiprocessing.Process instances instead of a pool). Instead, you must initialize a global variable in each pool process with the queue instance.
If you are running under a platform that uses fork to create new processes, then you can just create queue as a global and it will be inherited by each pool process:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
queue = Queue()
def my_task(x):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(queue.get())
Prints:
1
0
2
3
6
5
4
7
8
9
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
If you need portability with platforms that do not use the fork method to create processes, such as Windows (which uses the spawn method), then you cannot allocate the queue as a global since each pool process will create its own queue instance. Instead, the main process must create the queue and then initialize each pool process' global queue variable by using the initializer and initargs:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
def init_pool_processes(q):
global queue
queue = q
def my_task(x):
queue.put("Task Complete")
return x
# Windows compatibilitY
if __name__ == '__main__':
q = Queue()
with ProcessPoolExecutor(initializer=init_pool_processes, initargs=(q,)) as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(q.get())
If you want to advance a progress bar as each task completes (you haven't precisely stated how the bar is to advance; see my comment to your question), then the following shows that a queue is necessary. If each task submitted consisted of N parts (for a total of 10 * N parts, since there are 10 tasks) and would like to see a single progress bar advance as each part is completed, then a queue is probably the most straight forward way of signaling a part completion back to the main process.
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
def my_task(x):
return x
# Windows compatibilitY
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
with tqdm(total=10) as bar:
tasks = [executor.submit(my_task, i) for i in range(10)]
for _ in as_completed(tasks):
bar.update()
# To get the results in task submission order:
results = [task.result() for task in tasks]
print(results)

Python multiprocessing queue fails with subprocesses

I'm trying to add spawned subprocesses to a queue so that only one can execute at a time. I don't want to wait for the process to execute with each iteration of the for loop because there is some code at the beginning of the loop that can run in parallel with the processes.
Example code:
from multiprocessing import Queue
import subprocess
q = Queue()
hrs = range(0,12)
for hr in hrs:
print(hr)
# There will be other code here that takes some time to run
p = subprocess.Popen(['python', 'test.py', '--hr={}'.format(hr)])
q.put(p)
This results in:
Traceback (most recent call last):
File "/home/kschneider/anaconda3/envs/ewall/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/kschneider/anaconda3/envs/ewall/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object
Is there a different way to set this up that will not result in the thread lock errors?

Declare shared memory after process start

I would like be able to create new multiprocessing.Value or multiprocessing.Array after process start. Like in this example:
# coding: utf-8
import multiprocessing
shared = {
'foo': multiprocessing.Value('i', 42),
}
def job(pipe):
while True:
shared_key = pipe.recv()
print(shared[shared_key].value)
process_read_pipe, process_write_pipe = multiprocessing.Pipe(duplex=False)
process = multiprocessing.Process(
target=job,
args=(process_read_pipe, )
)
process.start()
process_write_pipe.send('foo')
shared['bar'] = multiprocessing.Value('i', 24)
process_write_pipe.send('bar')
Ouput:
42
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/bux/Projets/synergine2/p.py", line 12, in job
print(shared[shared_key].value)
KeyError: 'bar'
Process finished with exit code 0
Problem here is: shared dict is copied into process when it's start. But, if i add a key in shared dict, process can't see it. How this started process can be informed about existence of new multiprocessing.Value('i', 24) ?
It can't be given thought pipe because:
Synchronized objects should only be shared between processes through inheritance
Any idea ?

It looks like you are assuming the shared variable is accessable by both threads. Only the shared["foo"] variable is accessable by both threads. You need to share a dictionary.
Here is an example: Python multiprocessing: How do I share a dict among multiple processes?

Pika Consumer as a Python Process (multiprocessing)

I am trying to use the example Pika Async consumer (http://pika.readthedocs.io/en/0.10.0/examples/asynchronous_consumer_example.html) as a multiprocessing process (by making the ExampleConsumer class subclass multiprocessing.Process). However, I'm running into some issues with gracefully shutting down everything.
Let's say for example I have defined my procs as below:
for k, v in queues_callbacks.iteritems():
proc = ExampleConsumer(queue, k, v, rabbit_user, rabbit_pw, rabbit_host, rabbit_port)
"queues_callbacks" is basically just a dictionary of exchange : callback_function (ideally I'd like to be able to connect to several exchanges with this architecture).
Then I do the normal python way of dealing with starting processes:
try:
for proc in self.consumers:
proc.start()
for proc in self.consumers:
proc.join()
except KeyboardInterrupt:
for proc in self.consumers:
proc.terminate()
proc.join(1)
The issue is coming when I try to stop everything. Let's say I've overriden the "terminate" method to call the consumer's "stop" method then continue on with the normal terminate of Process. With this structure, I am getting some strange attribute errors
Traceback (most recent call last):
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 154, in <module>
main()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 150, in main
mybot.start()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 71, in start
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 53, in stop
self.__stop_consumers__()
File "/Users/christopheralexander/PycharmProjects/new_bot/abstract_bot.py", line 130, in __stop_consumers__
self.consumers[0].terminate()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 414, in terminate
self.stop()
File "/Users/christopheralexander/PycharmProjects/new_bot/rabbit_consumer.py", line 399, in stop
self._connection.ioloop.start()
AttributeError: 'NoneType' object has no attribute 'ioloop'
It's as if these attributes somehow disappear at some point. In the particular case above, _connection is initialized as None, but then gets set when the Consumer is started. However, when the "stop" method is called, it has already reverted back to None (with nothing set to do so). I'm also observing other strange behavior, such as times when it appears that things are getting called twice (even though "stop" is called once). Any ideas as to what is going on here, or is this not the proper way of architecting this?
Thanks!

Storing a process object in multiprocessing Manager.dict()

I have created this sample program to generalize the issue i am facing
import multiprocessing
from multiprocessing import Manager
def f (_print):
print _print
manager = multiprocessing.Manager()
dict = manager.dict()
dict['process_obj'] = multiprocessing.current_process()
print dict
if __name__ == '__main__':
process = multiprocessing.Process(target=f, args= ('hello function', ))
process.start()
process.join()
So how do I store a process object in multiprocessing Manager.dict()?

I assume you're talking about getting this error:
hello function
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mp2.py", line 8, in f
dict['process_obj'] = multiprocessing.current_process()
File "<string>", line 2, in __setitem__
File "/usr/local/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
(it's generally a good idea to include "what I got" and "what I expected to get instead" in the question).
The fundamental problem here is that multiprocessing.current_process() returns an instance method. Instance methods don't pickle properly, and multiprocessing has to save (pickle) and load (unpickle) shared data items to communicate their values from one process to another. See, e.g., Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map() and Overcoming Python's limitations regarding instance methods. Note in particular one of the answers in the second: it might be better to figure out some state to send/share, rather than an entire instance. For instance, if the ident of a process suffices, you can do this:
dict['process_obj'] = multiprocessing.current_process().ident
which works fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multiprocessing: Topping off multiprocessing queue before becoming empty - python

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

Python multiprocessing queue fails with subprocesses

Declare shared memory after process start

Pika Consumer as a Python Process (multiprocessing)

Storing a process object in multiprocessing Manager.dict()

Categories

Resources