Knowing when you've read everything off a multiprocessing Queue - python

I have some code that farms out work to tasks. The tasks put their results on a queue, and the main thread reads these results from the queue and deals with them.
from multiprocessing import Process, Queue, Pool, Manager
import uuid
def handle_task(arg, queue, end_marker):
... add some number of results to the queue . . .
queue.put(end_marker)
def main(tasks):
manager = Manager()
queue = manager.Queue()
count = len(tasks)
end_marker = uuid.uuid4()
with Pool() as pool:
pool.starmap(handle_task, ((task, queue, end_marker) for task in tasks))
while count > 0:
value = queue.get()
if value == end_marker:
count -= 1
else:
... deal with value ...
This code works, but it is incredibly kludgy and inelegant. What if tasks is a iterator? Why do I need to know how many tasks there are ahead of time and keep track of each of them.
Is there a cleaner way of reading from a Queue and and knowing that every process that will write to that thread is done, and you've read everything that they've written?

First of all, operations on a managed queue are very slow compared to a multiprocessing.Queue instance. But why are you even using an an additional queue to return results when a multiprocessing pool already uses such a queue for returning results? Instead of having handle_task write some number of result values to a queue, it could simply return a list of these values. For example,
from multiprocessing import Pool
def handle_task(arg):
results = []
# Add some number of results to the results list:
results.append(arg + arg)
results.append(arg * arg)
return results
def main(tasks):
with Pool() as pool:
map_results = pool.map(handle_task, tasks)
for results in map_results:
for value in results:
# Deal with value:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
Prints:
14
49
4
4
6
9
As a side benefit, the results returned will be in task-submission order, which one day might be important. If you want to be able to process the returned values as they become available, then you can use pool.imap or pool.imap_unordered (if you don't care about the order of the returned values, which seems to be the case):
from multiprocessing import Pool
def handle_task(arg):
results = []
# Add some number of results to the results list:
results.append(arg + arg)
results.append(arg * arg)
return results
def main(tasks):
with Pool() as pool:
for results in pool.imap_unordered(handle_task, tasks):
for value in results:
# Deal with value:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
If the number of tasks being submitted is "large", then you should probably use the chunksize argument of the imap_unordered method. A reasonable value would be len(tasks) / (4 * pool_size) where you are using by default a value of multiprocessing.cpu_count() for your pool size. This is more or less how a chunksize value is computed when you use the map or starmap methods and you have not specified the chunksize argument.
Using a multiprocessing.Queue instance
from multiprocessing import Pool, Queue
from queue import Empty
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map(handle_task, tasks)
try:
while True:
value = queue.get_nowait()
print(value)
except Empty:
pass
if __name__ == '__main__':
main([7, 2, 3])
Although callling queue.empty() is not supposed to be reliable for a multiprocessing.Queue instance, as long as you are doing this after all the tasks have finished processing, it seems no more unreliable than relying on blocking get calls raising an exception only after all items have been retrieved:
from multiprocessing import Pool, Queue
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map(handle_task, tasks)
while not queue.empty():
value = queue.get_nowait()
print(value)
if __name__ == '__main__':
main([7, 2, 3])
But if you want to do everything strictly according to what the documentation implies is the only reliable method when using a multiprocessing.Queue instance, that would be by using sentinels as you already are doing:
from multiprocessing import Pool, Queue
class Sentinel:
pass
SENTINEL = Sentinel()
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
queue.put(SENTINEL)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map_async(handle_task, tasks) # Does not block
sentinel_count = len(tasks)
while sentinel_count != 0:
value = queue.get()
if isinstance(value, Sentinel):
sentinel_count -= 1
else:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
Conclusion
If you need to use a queue for output, I would recommend a multiprocessing.Queue. In this case using sentinels is really the only 100% correct way of proceeding. I would also use the map_async method so that you can start processing results as they are returned.
Using a Managed Queue
This is Pingu's answer, which remains deleted for now:
from multiprocessing import Pool, Manager
from random import randint
def process(n, q):
for x in range(randint(1, 10)):
q.put((n, x))
def main():
with Manager() as manager:
queue = manager.Queue()
with Pool() as pool:
pool.starmap(process, [(n, queue) for n in range(5)])
while not queue.empty():
print(queue.get())
if __name__ == '__main__':
main()

Related

Python dynamic MultiThread with Queue - Class

I have been struggling to implement a proper dynamic multi-thread system until now. The idea is to spin up multiple new pools of sub-threads from the main (each pool have its own number of threads and queue size) to run functions and the user can define if the main should wait for the sub-thread to finish up or just move to the next line after starting the thread. This multi-thread logic will help to extract data in parallel and at a fast frequency.
The solution to my issue is shared below for everyone who wants it. If you have any doubts and questions, please let me know.
# -*- coding: utf-8 -*-
"""
Created on Mon Jul 5 00:00:51 2021
#author: Tahasanul Abraham
"""
#%% Initialization of Libraries
import sys, os, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir)
parentdir_1up = os.path.dirname(parentdir)
sys.path.insert(0,parentdir_1up)
from queue import Queue
from threading import Thread, Lock
class Worker(Thread):
def __init__(self, tasks):
Thread.__init__(self)
self.tasks = tasks
self.daemon = True
self.lock = Lock()
self.start()
def run(self):
while True:
func, args, kargs = self.tasks.get()
try:
if func.lower() == "terminate":
break
except:
try:
with self.lock:
func(*args, **kargs)
except Exception as exception:
print(exception)
self.tasks.task_done()
class ThreadPool:
def __init__(self, num_threads, num_queue=None):
if num_queue is None or num_queue < num_threads:
num_queue = num_threads
self.tasks = Queue(num_queue)
self.threads = num_threads
for _ in range(num_threads): Worker(self.tasks)
# This function can be called to terminate all the worker threads of the queue
def terminate(self):
self.wait_completion()
for _ in range(self.threads): self.add_task("terminate")
return None
# This function can be called to add new work to the queue
def add_task(self, func, *args, **kargs):
self.tasks.put((func, args, kargs))
# This function can be called to wait till all the workers are done processing the pending works. If this function is called, the main will not process any new lines unless all the workers are done with the pending works.
def wait_completion(self):
self.tasks.join()
# This function can be called to check if there are any pending/running works in the queue. If there are any works pending, the call will return Boolean True or else it will return Boolean False
def is_alive(self):
if self.tasks.unfinished_tasks == 0:
return False
else:
return True
#%% Standalone Run
if __name__ == "__main__":
import time
def test_return(x,d):
print (str(x) + " - pool completed")
d[str(x)] = x
time.sleep(5)
# 2 thread and 10000000000 FIFO queues
pool = ThreadPool(2,1000000000)
r ={}
for i in range(10):
pool.add_task(test_return, i, r)
print (str(i) + " - pool added")
print ("Waiting for completion")
pool.wait_completion()
print ("pool done")
# 1 thread and 2 FIFO queues
pool = ThreadPool(1,2)
r ={}
for i in range(10):
pool.add_task(test_return, i, r)
print (str(i) + " - pool added")
print ("Waiting for completion")
pool.wait_completion()
print ("pool done")
# 2 thread and 1 FIFO queues
pool = ThreadPool(2,1)
r ={}
for i in range(10):
pool.add_task(test_return, i, r)
print (str(i) + " - pool added")
print ("Waiting for completion")
pool.wait_completion()
print ("pool done")
Making a new Pool
Using the above classes, one can make a pool of their own choise with the number of parallel threads they want and the size of the queue. Example of creating a pool of 10 threads with 200 queue size.
pool = ThreadPool(10,200)
Adding work to Pool
Once a pool is created, one can use that pool.add_task to do sub-routine works. In my example version i used the pool to call a function and its arguments. Example, I called the test_return fucntion with its arguments i and r.
pool.add_task(test_return, i, r)
Waiting for the pool to complete its work
If a pool is given some work to do, the user can either move to other code lines or wait for the pool to finish its work before the next lines ar being read. To wait for the pool to finish the work and then return back, a call for wait_completion is required. Example:
pool.wait_completion()
Terminate and close down the pool threads
Once the requirement of the pool threads are done, it is possible to terminate and close down the pool threads to save up memory and release the blocked threads. This can be done by calling the following function.
pool.terminate()
Checking if there are any pending works from the pool
There is a function that can be called to check if there are any pending/running works in the queue. If there are any works pending, the call will return Boolean True, or else it will return Boolean False. To check if the pool is working or not call the folling function.
pool.is_alive()

Force Multiprocessing Pool to iterate over argument

I'm using multiprocessing Pool to run a function for multiple arguments over and over. I use a list for jobs that filled by another thread and a job_handler function to handles each job. My problem is that when the list becomes empty the Pool will end the function. I want to keep the pool alive and wait until the list to fill. Actually, there are two scenarios to solve this.
1.Use one pool but would end after list become empty:
from multiprocessing import Pool
from threading import Thread
from time import sleep
def job_handler(i):
print("Doing job:", i)
sleep(0.5)
def job_adder():
i = 0
while True:
jobs.append(i)
i += 1
sleep(0.1)
if __name__ == "__main__":
pool = Pool(4)
jobs = []
thr = Thread(target=job_adder)
thr.start()
# wait for job_adder to add to list
sleep(1)
pool.map_async(job_handler, jobs)
while True:
pass
2.Multiple map_async:
from multiprocessing import Pool
from threading import Thread
from time import sleep
def job_handler(i):
print("Doing job:", i)
sleep(0.5)
def job_adder():
i = 0
while True:
jobs.append(i)
i += 1
sleep(0.1)
if __name__ == "__main__":
pool = Pool(4)
jobs = []
thr = Thread(target=job_adder)
thr.start()
while True:
for job in jobs:
pool1 = pool.map_async(job_handler, (job,))
jobs.remove(job)
What is the difference between the two? I think the first option would be nicer because the map itself would handle the iteration. My aim is to get better performance to handle each job separately.
The need to “slow down” a Pool comes up in a number of situations. This case is easier than some:
q=queue.Queue()
m=pool.imap(iter(q.get,None))
You can also use imap_unordered; None is a sentinel to terminate the Pool. The Pool has to use a thread to collect the tasks (since those functions are “lazier [than] map()”), and it will block on q as needed.

python - multiprocessing with queue

Here is my code below , I put string in queue , and hope dowork2 to do something work , and return char in shared_queue
but I always get nothing at while not shared_queue.empty()
please give me some point , thanks.
import time
import multiprocessing as mp
class Test(mp.Process):
def __init__(self, **kwargs):
mp.Process.__init__(self)
self.daemon = False
print('dosomething')
def run(self):
manager = mp.Manager()
queue = manager.Queue()
shared_queue = manager.Queue()
# shared_list = manager.list()
pool = mp.Pool()
results = []
results.append(pool.apply_async(self.dowork2,(queue,shared_queue)))
while True:
time.sleep(0.2)
t =time.time()
queue.put('abc')
queue.put('def')
l = ''
while not shared_queue.empty():
l = l + shared_queue.get()
print(l)
print( '%.4f' %(time.time()-t))
pool.close()
pool.join()
def dowork2(queue,shared_queue):
while True:
path = queue.get()
shared_queue.put(path[-1:])
if __name__ == '__main__':
t = Test()
t.start()
# t.join()
# t.run()
I managed to get it work by moving your dowork2 outside the class. If you declare dowork2 as a function before Test class and call it as
results.append(pool.apply_async(dowork2, (queue, shared_queue)))
it works as expected. I am not 100% sure but it probably goes wrong because your Test class is already subclassing Process. Now when your pool creates a subprocess and initialises the same class in the subprocess, something gets overridden somewhere.
Overall I wonder if Pool is really what you want to use here. Your worker seems to be in an infinite loop indicating you do not expect a return value from the worker, only the result in the return queue. If this is the case, you can remove Pool.
I also managed to get it work keeping your worker function within the class when I scrapped the Pool and replaced with another subprocess:
foo = mp.Process(group=None, target=self.dowork2, args=(queue, shared_queue))
foo.start()
# results.append(pool.apply_async(Test.dowork2, (queue, shared_queue)))
while True:
....
(you need to add self to your worker, though, or declare it as a static method:)
def dowork2(self, queue, shared_queue):

Delete Objects in a List as Passed to Multiprocessing

I need to pass each object in a large list to a function. After the function completes I no longer need the object passed to the function and would like to delete the object to save memory. If I were working with a single process I would do the following:
result = []
while len(mylist) > 0:
result.append(myfunc(mylist.pop())
As I loop over mylist I pop off each object in the list such that the object is no longer stored in mylist after it's passed to my function. How do I achieve this same effect in parallel using multiprocessing?
A simple consumer example (credits go here) :
import multiprocessing
import time
import random
class Consumer(multiprocessing.Process):
def __init__(self, task_queue, result_queue):
multiprocessing.Process.__init__(self)
self.task_queue = task_queue
self.result_queue = result_queue
def run(self):
while True:
task = self.task_queue.get()
if task is None:
# Poison pill means shutdown
self.task_queue.task_done()
break
answer = task.process()
self.task_queue.task_done()
self.result_queue.put(answer)
return
class Task(object):
def process(self):
time.sleep(0.1) # pretend to take some time to do the work
return random.randint(0, 100)
if __name__ == '__main__':
# Establish communication queues
tasks = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
# Start consumers
num_consumers = multiprocessing.cpu_count() * 2
consumers = [Consumer(tasks, results) for i in xrange(num_consumers)]
for consumer in consumers:
consumer.start()
# Enqueue jobs
num_jobs = 10
for _ in xrange(num_jobs):
tasks.put(Task())
# Add a poison pill for each consumer
for _ in xrange(num_consumers):
tasks.put(None)
# Wait for all tasks to finish
tasks.join()
# Start printing results
while num_jobs:
result = results.get()
print 'Result:', result
num_jobs -= 1

Can I use a multiprocessing Queue in a function called by Pool.imap?

I'm using python 2.7, and trying to run some CPU heavy tasks in their own processes. I would like to be able to send messages back to the parent process to keep it informed of the current status of the process. The multiprocessing Queue seems perfect for this but I can't figure out how to get it work.
So, this is my basic working example minus the use of a Queue.
import multiprocessing as mp
import time
def f(x):
return x*x
def main():
pool = mp.Pool()
results = pool.imap_unordered(f, range(1, 6))
time.sleep(1)
print str(results.next())
pool.close()
pool.join()
if __name__ == '__main__':
main()
I've tried passing the Queue in several ways, and they get the error message "RuntimeError: Queue objects should only be shared between processes through inheritance". Here is one of the ways I tried based on an earlier answer I found. (I get the same problem trying to use Pool.map_async and Pool.imap)
import multiprocessing as mp
import time
def f(args):
x = args[0]
q = args[1]
q.put(str(x))
time.sleep(0.1)
return x*x
def main():
q = mp.Queue()
pool = mp.Pool()
results = pool.imap_unordered(f, ([i, q] for i in range(1, 6)))
print str(q.get())
pool.close()
pool.join()
if __name__ == '__main__':
main()
Finally, the 0 fitness approach (make it global) doesn't generate any messages, it just locks up.
import multiprocessing as mp
import time
q = mp.Queue()
def f(x):
q.put(str(x))
return x*x
def main():
pool = mp.Pool()
results = pool.imap_unordered(f, range(1, 6))
time.sleep(1)
print q.get()
pool.close()
pool.join()
if __name__ == '__main__':
main()
I'm aware that it will probably work with multiprocessing.Process directly and that there are other libraries to accomplish this, but I hate to back away from the standard library functions that are a great fit until I'm sure it's not just my lack of knowledge keeping me from being able to exploit them.
Thanks.
The trick is to pass the Queue as an argument to the initializer. Appears to work with all the Pool dispatch methods.
import multiprocessing as mp
def f(x):
f.q.put('Doing: ' + str(x))
return x*x
def f_init(q):
f.q = q
def main():
jobs = range(1,6)
q = mp.Queue()
p = mp.Pool(None, f_init, [q])
results = p.imap(f, jobs)
p.close()
for i in range(len(jobs)):
print q.get()
print results.next()
if __name__ == '__main__':
main()
With fork start method (i.e., on Unix platforms), you do NOT need to use that initializer trick in the top answer
Just define mp.Queue as a global variable and it will be correctly inherited by the child processes.
OP's example works fine using Python 3.9.7 on Linux (code slightly adjusted):
import multiprocessing as mp
import time
q = mp.Queue()
def f(x):
q.put(str(x))
return x * x
def main():
pool = mp.Pool(5)
pool.imap_unordered(f, range(1, 6))
time.sleep(1)
for _ in range(1, 6):
print(q.get())
pool.close()
pool.join()
if __name__ == '__main__':
main()
Output:
2
1
3
4
5
It's been 12 years, but I'd like to make sure any Linux user who come across this question knows the top answer's trick is only needed if you cannot use fork

Categories

Resources