Multiprocessing Pool return results as available - python

I'm trying to implement multiprocessing and struggling to get where I need to get.
So some background I have previously done multiprocessing with Celery, so I am used to being able to send jobs to a worker and poll when its done, and get the results of the job even if other jobs are going. I'm trying to relate this to multiprocessing. Here is what I have so far, dug up from various sites I have found...
import urllib2
import time
from multiprocessing.dummy import Pool as ThreadPool
import random
def openurl(url):
time.sleep(random.randrange(1,10))
print url
return urllib2.urlopen(url)
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc..
]
pool = ThreadPool(20)
results = pool.map(openurl, urls)
pool.close()
pool.join()
print 'now what'
So I am kicking off the openurl function on my urls, but if I break at "print 'now what'", it does not break there until all my jobs are complete.
How can I 'poll' my threads and return the results as they come in?
Thanks!

pool.map distributes the iterable elements over a Pool of Workers and aggregates the results when they are all ready.
Moreover, pool.close and pool.join instruct the Pool to wait until all the tasks are done.
If you want to handle the results as they come, you have to use pool.apply_async and use a callback. Or you can collect the AsyncResult objects returned by pool.apply_async and iterate over them to see when each of those is ready but the whole logic would be quite cumbersome.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(20)
tasks = []
def callback(result):
# handle the result of your function here
print result
for url in urls:
pool.apply_async(openurl, args=[url], callback=callback)
pool.close()
pool.join()

Related

How to add tqdm here?

How do I add tqdm for the multiprocessing for loop here. Namely I want to wrap urls in tqdm():
jobs = []
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
for url in urls:
job = pool.apply_async(worker, (url, q))
jobs.append(job)
for job in jobs:
job.get()
pool.close()
pool.join()
The suggested solution on GitHub is this:
pbar = tqdm(total=100)
def update(*a):
pbar.update()
# tqdm.write(str(a))
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()
But my iterable is a list of URLs as opposed to a range like in the above. How do I translate the above solution to my for loop?
The easiest solution that is compatible with your current code is to just specify the callback argument to apply_async (and if there is a possibility of an exception in worker, then specify the error_callback argument too).
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
# So that the progress par does not proceed to quickly
# for demo purposes:
import time
time.sleep(1)
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
def my_callback(result):
pbar.update()
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
with tqdm(total=len(urls)) as pbar:
pool = Pool()
jobs = [
pool.apply_async(worker, (url,), callback=my_callback, error_callback=my_callback)
for url in urls
]
# You can delete the next two statements if you don't need
# to save the value of jobs.get() since the calls to
# pool.close() and pool.join() will wait for all submitted
# tasks to complete:
for job in jobs:
job.get()
pool.close()
pool.join()
Or instead of using apply_async, use imap (or imap_unordered if you do not care either about the results or the order of the results):
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
import time
time.sleep(1) # so that the progress par does not proceed to quickly:
return url
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
pool = Pool()
results = list(tqdm(pool.imap(worker, urls), total=len(urls)))
print(results)
pool.close()
pool.join()
Note
If you won't or can't use apply_async with a callback, then imap_unordered is to be preferred over imap, assuming you don't need to have the results returned in task-submission order, which imap is obliged to do. The potential problem with imap is that if for some reason the first task submitted to the pool were the last to complete, no results can be returned until that first submitted task finishes. When that occurs all the other submitted task will have already completed and so your progress bar will not move at all and then it will suddenly go from 0% to 100% as quickly as you can iterate the results.
Admittedly the above scenario is an extreme case not likely to occur too often, but you would still like the progress bar to advance as tasks complete regardless of that order of completion. For this and getting results back in task-submission order apply_async with a callback is probably best. The only drawback to apply_async is that if you have a very large number of tasks to submit, they cannot be "chunked up" (see the chunksize argument to imap and imap_unordered) without your doing your own chunking logic.
You can use Parallel and Delayed from Joblib and use tqdm in the following manner:
from multiprocessing import cpu_count
from joblib import Parallel, delayed
def process_urls(urls,i):
#define your function here
Call function using:
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
Parallel(n_jobs=cpu_count(), prefer='processes')(delayed(process_urls)(urls, i) for i in tqdm(range(len(urls.axes[0]))))

limit number of threads used by a child process launched with ``multiprocessing.Process`

I'm trying to launch a function (my_function) and stop its execution after a certain time is reached.
So i challenged multiprocessing library and everything works well. Here is the code, where my_function() has been changed to only create a dummy message.
from multiprocessing import Queue, Process
from multiprocessing.queues import Empty
import time
timeout=1
# timeout=3
def my_function(something):
time.sleep(2)
return f'my message: {something}'
def wrapper(something, queue):
message ="too late..."
try:
message = my_function(something)
return message
finally:
queue.put(message)
try:
queue = Queue()
params = ("hello", queue)
child_process = Process(target=wrapper, args=params)
child_process.start()
output = queue.get(timeout=timeout)
print(f"ok: {output}")
except Empty:
timeout_message = f"Timeout {timeout}s reached"
print(timeout_message)
finally:
if 'child_process' in locals():
child_process.kill()
You can test and verify that depending on timeout=1 or timeout=3, i can trigger an error or not.
My main problem is that the real my_function() is a torch model inference for which i would like to limit the number of threads (to 4 let's say)
One can easily do so if my_function were in the main process, but in my example i tried a lot of tricks to limit it in the child process without any success (using threadpoolctl.threadpool_limits(4), torch.set_num_threads(4), os.environ["OMP_NUM_THREADS"]=4, os.environ["MKL_NUM_THREADS"]=4).
I'm completely open to other solution that can monitor the time execution of a function while limiting the number of threads used by this function.
thanks
Regards
You can limit simultaneous process with Pool. (https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool)
You can set max tasks done per child. Check it out.
Here you have a sample from superfastpython by Jason Brownlee:
# SuperFastPython.com
# example of limiting the number of tasks per child in the process pool
from time import sleep
from multiprocessing.pool import Pool
from multiprocessing import current_process
# task executed in a worker process
def task(value):
# get the current process
process = current_process()
# report a message
print(f'Worker is {process.name} with {value}', flush=True)
# block for a moment
sleep(1)
# protect the entry point
if __name__ == '__main__':
# create and configure the process pool
with Pool(2, maxtasksperchild=3) as pool:
# issue tasks to the process pool
for i in range(10):
pool.apply_async(task, args=(i,))
# close the process pool
pool.close()
# wait for all tasks to complete
pool.join()

How to tell if an apply_async function has started or if it's still in the queue with multiprocessing.Pool

I'm using python's multiprocessing.Pool and apply_async to call a bunch of functions.
How can I tell whether a function has started processing by a member of the pool or whether it is sitting in a queue?
For example:
import multiprocessing
import time
def func(t):
#take some time processing
print 'func({}) started'.format(t)
time.sleep(t)
pool = multiprocessing.Pool()
results = [pool.apply_async(func, [t]) for t in [100]*50] #adds 50 func calls to the queue
For each AsyncResult in results you can call ready() or get(0) to see if the func finished running. But how do you find out whether the func started but hasn't finished yet?
i.e. for a given AsyncResult object (i.e. a given element of results) is there a way to see whether the function has been called or if it's sitting in the pool's queue?
First, remove completed jobs from results list
results = [r for r in results if not r.ready()]
Number of processes pending is length of results list:
pending = len(results)
And number pending but not started is total pending - pool_size
not_started = pending - pool_size
pool_size will be multiprocessing.cpu_count() if Pool is created with default argument as you did
UPDATE:
After initially misunderstanding the question, here's a way to do what OP was asking about.
I suspect this functionality could be added to the Pool class without too much trouble because AsyncResult is implemented by Pool with a Queue. That queue could also be used internally to indicate whether started or not.
But here's a way to implement using Pool and Pipe. NOTE: this doesn't work in Python 2.x -- not sure why. Tested in Python 3.8.
import multiprocessing
import time
import os
def worker_function(pipe):
pipe.send('started')
print('[{}] started pipe={}'.format(os.getpid(), pipe))
time.sleep(3)
pipe.close()
def test():
pool = multiprocessing.Pool(processes=2)
print('[{}] pool={}'.format(os.getpid(), pool))
workers = []
for x in range(1, 4):
parent, child = multiprocessing.Pipe()
pool.apply_async(worker_function, (child,))
worker = {'name': 'worker{}'.format(x), 'pipe': parent, 'started': False}
workers.append(worker)
pool.close()
while True:
for worker in workers:
if worker.get('started'):
continue
pipe = worker.get('pipe')
if pipe.poll(0.1):
message = pipe.recv()
print('[{}] {} says {}'.format(os.getpid(), worker.get('name'), message))
worker['started'] = True
pipe.close()
count_in_queue = len(workers)
for worker in workers:
if worker.get('started'):
count_in_queue -= 1
print('[{}] count_in_queue = {}'.format(os.getpid(), count_in_queue))
if not count_in_queue:
break
time.sleep(0.5)
pool.join()
if __name__ == '__main__':
test()

Python multiprocessing wait for sleep

Im trying to find out how multiprocessing works in Python.
The following example is what I made:
import requests
from multiprocessing import Process
import time
def f(name):
print 'hello', name
time.sleep(15)
print 'ended', name
if __name__ == '__main__':
urls = [
'http://python-requests.org',
'http://httpbin.org',
'http://python-guide.org'
]
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
print("finished")
What I tried to simulate in f is a request to a URL that has a timeout of 15 seconds. What I expected to happen is that all the request would start at almost the same time and finish at the same time. But what actually happens is they all start one after each other and wait till the previous one is finished. So the result is:
hello http://python-requests.org
ended http://python-requests.org
hello http://httpbin.org
ended http://httpbin.org
hello http://python-guide.org
ended http://python-guide.org
So what actually happens? why would one use the code above instead of just doing:
for url in urls:
f(url)
the problem is your loop:
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
you're starting the process then you wait for it to complete, then you start the next one ...
Instead, create your process list, start them, and wait for them:
pl = [Process(target=f, args=(url,)) for url in urls]
for p in pl:
p.start()
for p in pl:
p.join()
note that in that case, using Process is probably overkill, since threads would do the job very well (no massive python computing involved, only system calls & networking)
To switch to threads, just use multiprocessing.dummy instead so your program structure remains the same.
import multiprocessing.dummy as multiprocessing
You only spawn one process. Thus, the process (a unique worker) takes the first input, runs f, timeouts during 15 sec, quit f; and then takes the second input. c.f. doc
You could try to map your function f with the inputs. In the example below, you spawn 2 processes (2 workers).
import multiprocessing as mp
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
p.map(f, urls)

Issue with Pool and Queue of multiprocessing module in Python

I am new to multiprocessing of Python, and I wrote the tiny script below:
import multiprocessing
import os
def task(queue):
print(100)
def run(pool):
queue = multiprocessing.Queue()
for i in range(os.cpu_count()):
pool.apply_async(task, args=(queue, ))
if __name__ == '__main__':
multiprocessing.freeze_support()
pool = multiprocessing.Pool()
run(pool)
pool.close()
pool.join()
I am wondering why the task() method is not executed and there is no output after running this script. Could anyone help me?
It is running, but it's dying with an error outside the main thread, and so you don't see the error. For that reason, it's always good to .get() the result of an async call, even if you don't care about the result: the .get() will raise the error that's otherwise invisible.
For example, change your loop like so:
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task, args=(queue,)))
for t in tasks:
t.get()
Then the new t.get() will blow up, ending with:
RuntimeError: Queue objects should only be shared between processes through inheritance
In short, passing Queue objects to Pool methods isn't supported.
But you can pass them to multiprocessing.Process(), or to a Pool initialization function. For example, here's a way to do the latter:
import multiprocessing
import os
def pool_init(q):
global queue # make queue global in workers
queue = q
def task():
# can use `queue` here if you like
print(100)
def run(pool):
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task))
for t in tasks:
t.get()
if __name__ == '__main__':
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(initializer=pool_init, initargs=(queue,))
run(pool)
pool.close()
pool.join()
On Linux-y systems, you can - as the original error message suggested - use process inheritance instead (but that's not possible on Windows).

Categories

Resources