How to add tqdm here? - python

How do I add tqdm for the multiprocessing for loop here. Namely I want to wrap urls in tqdm():
jobs = []
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
for url in urls:
job = pool.apply_async(worker, (url, q))
jobs.append(job)
for job in jobs:
job.get()
pool.close()
pool.join()
The suggested solution on GitHub is this:
pbar = tqdm(total=100)
def update(*a):
pbar.update()
# tqdm.write(str(a))
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()
But my iterable is a list of URLs as opposed to a range like in the above. How do I translate the above solution to my for loop?

The easiest solution that is compatible with your current code is to just specify the callback argument to apply_async (and if there is a possibility of an exception in worker, then specify the error_callback argument too).
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
# So that the progress par does not proceed to quickly
# for demo purposes:
import time
time.sleep(1)
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
def my_callback(result):
pbar.update()
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
with tqdm(total=len(urls)) as pbar:
pool = Pool()
jobs = [
pool.apply_async(worker, (url,), callback=my_callback, error_callback=my_callback)
for url in urls
]
# You can delete the next two statements if you don't need
# to save the value of jobs.get() since the calls to
# pool.close() and pool.join() will wait for all submitted
# tasks to complete:
for job in jobs:
job.get()
pool.close()
pool.join()
Or instead of using apply_async, use imap (or imap_unordered if you do not care either about the results or the order of the results):
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
import time
time.sleep(1) # so that the progress par does not proceed to quickly:
return url
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
pool = Pool()
results = list(tqdm(pool.imap(worker, urls), total=len(urls)))
print(results)
pool.close()
pool.join()
Note
If you won't or can't use apply_async with a callback, then imap_unordered is to be preferred over imap, assuming you don't need to have the results returned in task-submission order, which imap is obliged to do. The potential problem with imap is that if for some reason the first task submitted to the pool were the last to complete, no results can be returned until that first submitted task finishes. When that occurs all the other submitted task will have already completed and so your progress bar will not move at all and then it will suddenly go from 0% to 100% as quickly as you can iterate the results.
Admittedly the above scenario is an extreme case not likely to occur too often, but you would still like the progress bar to advance as tasks complete regardless of that order of completion. For this and getting results back in task-submission order apply_async with a callback is probably best. The only drawback to apply_async is that if you have a very large number of tasks to submit, they cannot be "chunked up" (see the chunksize argument to imap and imap_unordered) without your doing your own chunking logic.

You can use Parallel and Delayed from Joblib and use tqdm in the following manner:
from multiprocessing import cpu_count
from joblib import Parallel, delayed
def process_urls(urls,i):
#define your function here
Call function using:
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
Parallel(n_jobs=cpu_count(), prefer='processes')(delayed(process_urls)(urls, i) for i in tqdm(range(len(urls.axes[0]))))

Related

Python ThreadPoolExecutor terminate all threads

I am running a piece of python code in which multiple threads are run through threadpool executor. Each thread is supposed to perform a task (fetch a webpage for example). What I want to be able to do is to terminate all threads, even if one of the threads fail. For instance:
with ThreadPoolExecutor(self._num_threads) as executor:
jobs = []
for path in paths:
kw = {"path": path}
jobs.append(executor.submit(start,**kw))
for job in futures.as_completed(jobs):
result = job.result()
print(result)
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
#Signal all threads to stop
Is it possible to do so? The results returned by threads are useless to me unless all of them are successful, so if even one of them fails, I would like to save some execution time of the rest of the threads and terminate them immediately. The actual code obviously is doing relatively lengthy tasks with a couple of failure points.
If you are done with threads and want to look into processes, then this peace of code here looks very promising and simple, almost the same syntax as thread, but with the multiprocessing module.
When the timeout flag expires the process is terminated, very convenient.
import multiprocessing
def get_page(*args, **kwargs):
# your web page downloading code goes here
def start_get_page(timeout, *args, **kwargs):
p = multiprocessing.Process(target=get_page, args=args, kwargs=kwargs)
p.start()
p.join(timeout)
if p.is_alive():
# stop the downloading 'thread'
p.terminate()
# and then do any post-error processing here
if __name__ == "__main__":
start_get_page(timeout, *args, **kwargs)
I have created an answer for a similar question I had, which I think will work for this question.
Terminate executor using ThreadPoolExecutor from concurrent.futures module
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep
NUM_REQUESTS = 100
def long_request(id):
sleep(1)
# Simulate bad response
if id == 10:
return {"data": {"valid": False}}
else:
return {"data": {"valid": True}}
def check_results(results):
valid = True
for result in results:
valid = result["data"]["valid"]
return valid
def main():
futures = []
responses = []
num_requests = 0
with ThreadPoolExecutor(max_workers=10) as executor:
for request_index in range(NUM_REQUESTS):
future = executor.submit(long_request, request_index)
# Future list
futures.append(future)
for future in as_completed(futures):
is_responses_valid = check_results(responses)
# Cancel all future requests if one invalid
if not is_responses_valid:
executor.shutdown(wait=False)
else:
# Append valid responses
num_requests += 1
responses.append(future.result())
return num_requests
if __name__ == "__main__":
requests = main()
print("Num Requests: ", requests)
In my code I used multiprocessing
import multiprocessing as mp
pool = mp.Pool()
for i in range(threadNumber):
pool.apply_async(publishMessage, args=(map_metrics, connection_parameters...,))
pool.close()
pool.terminate()
This is how I would do it:
import concurrent.futures
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
return False
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(start, {"path": path}) for path in paths]
concurrent.futures.wait(results, timeout=10, return_when=concurrent.futures.FIRST_COMPLETED)
for f in concurrent.futures.as_completed(results):
f_success = f.result()
if not f_success:
executor.shutdown(wait=False, cancel_futures=True) # shutdown if one fails
else:
#do stuff here
If any result is not True, everything will be shut down immediately.
You can try to use StoppableThread from func-timeout.
But terminating threads is strongly discouraged. And if you need to kill a thread, you probably have a design problem. Look at alternatives: asyncio coroutines and multiprocessing with legal cancel/terminating functionality.

How to tell if an apply_async function has started or if it's still in the queue with multiprocessing.Pool

I'm using python's multiprocessing.Pool and apply_async to call a bunch of functions.
How can I tell whether a function has started processing by a member of the pool or whether it is sitting in a queue?
For example:
import multiprocessing
import time
def func(t):
#take some time processing
print 'func({}) started'.format(t)
time.sleep(t)
pool = multiprocessing.Pool()
results = [pool.apply_async(func, [t]) for t in [100]*50] #adds 50 func calls to the queue
For each AsyncResult in results you can call ready() or get(0) to see if the func finished running. But how do you find out whether the func started but hasn't finished yet?
i.e. for a given AsyncResult object (i.e. a given element of results) is there a way to see whether the function has been called or if it's sitting in the pool's queue?
First, remove completed jobs from results list
results = [r for r in results if not r.ready()]
Number of processes pending is length of results list:
pending = len(results)
And number pending but not started is total pending - pool_size
not_started = pending - pool_size
pool_size will be multiprocessing.cpu_count() if Pool is created with default argument as you did
UPDATE:
After initially misunderstanding the question, here's a way to do what OP was asking about.
I suspect this functionality could be added to the Pool class without too much trouble because AsyncResult is implemented by Pool with a Queue. That queue could also be used internally to indicate whether started or not.
But here's a way to implement using Pool and Pipe. NOTE: this doesn't work in Python 2.x -- not sure why. Tested in Python 3.8.
import multiprocessing
import time
import os
def worker_function(pipe):
pipe.send('started')
print('[{}] started pipe={}'.format(os.getpid(), pipe))
time.sleep(3)
pipe.close()
def test():
pool = multiprocessing.Pool(processes=2)
print('[{}] pool={}'.format(os.getpid(), pool))
workers = []
for x in range(1, 4):
parent, child = multiprocessing.Pipe()
pool.apply_async(worker_function, (child,))
worker = {'name': 'worker{}'.format(x), 'pipe': parent, 'started': False}
workers.append(worker)
pool.close()
while True:
for worker in workers:
if worker.get('started'):
continue
pipe = worker.get('pipe')
if pipe.poll(0.1):
message = pipe.recv()
print('[{}] {} says {}'.format(os.getpid(), worker.get('name'), message))
worker['started'] = True
pipe.close()
count_in_queue = len(workers)
for worker in workers:
if worker.get('started'):
count_in_queue -= 1
print('[{}] count_in_queue = {}'.format(os.getpid(), count_in_queue))
if not count_in_queue:
break
time.sleep(0.5)
pool.join()
if __name__ == '__main__':
test()

Python multiprocessing wait for sleep

Im trying to find out how multiprocessing works in Python.
The following example is what I made:
import requests
from multiprocessing import Process
import time
def f(name):
print 'hello', name
time.sleep(15)
print 'ended', name
if __name__ == '__main__':
urls = [
'http://python-requests.org',
'http://httpbin.org',
'http://python-guide.org'
]
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
print("finished")
What I tried to simulate in f is a request to a URL that has a timeout of 15 seconds. What I expected to happen is that all the request would start at almost the same time and finish at the same time. But what actually happens is they all start one after each other and wait till the previous one is finished. So the result is:
hello http://python-requests.org
ended http://python-requests.org
hello http://httpbin.org
ended http://httpbin.org
hello http://python-guide.org
ended http://python-guide.org
So what actually happens? why would one use the code above instead of just doing:
for url in urls:
f(url)
the problem is your loop:
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
you're starting the process then you wait for it to complete, then you start the next one ...
Instead, create your process list, start them, and wait for them:
pl = [Process(target=f, args=(url,)) for url in urls]
for p in pl:
p.start()
for p in pl:
p.join()
note that in that case, using Process is probably overkill, since threads would do the job very well (no massive python computing involved, only system calls & networking)
To switch to threads, just use multiprocessing.dummy instead so your program structure remains the same.
import multiprocessing.dummy as multiprocessing
You only spawn one process. Thus, the process (a unique worker) takes the first input, runs f, timeouts during 15 sec, quit f; and then takes the second input. c.f. doc
You could try to map your function f with the inputs. In the example below, you spawn 2 processes (2 workers).
import multiprocessing as mp
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
p.map(f, urls)

Python: concurrent.futures How to make it cancelable?

Python concurrent.futures and ProcessPoolExecutor provide a neat interface to schedule and monitor tasks. Futures even provide a .cancel() method:
cancel(): Attempt to cancel the call. If the call is currently being executed and cannot be cancelled then the method will return False, otherwise the call will be cancelled and the method will return True.
Unfortunately in a simmilar question (concerning asyncio) the answer claims running tasks are uncancelable using this snipped of the documentation, but the docs dont say that, only if they are running AND uncancelable.
Submitting multiprocessing.Events to the processes is also not trivially possible (doing so via parameters as in multiprocess.Process returns a RuntimeError)
What am I trying to do? I would like to partition a search space and run a task for every partition. But it is enough to have ONE solution and the process is CPU intensive. So is there an actual comfortable way to accomplish this that does not offset the gains by using ProcessPool to begin with?
Example:
from concurrent.futures import ProcessPoolExecutor, FIRST_COMPLETED, wait
# function that profits from partitioned search space
def m_run(partition):
for elem in partition:
if elem == 135135515:
return elem
return False
futures = []
# used to create the partitions
steps = 100000000
with ProcessPoolExecutor(max_workers=4) as pool:
for i in range(4):
# run 4 tasks with a partition, but only *one* solution is needed
partition = range(i*steps,(i+1)*steps)
futures.append(pool.submit(m_run, partition))
done, not_done = wait(futures, return_when=FIRST_COMPLETED)
for d in done:
print(d.result())
print("---")
for d in not_done:
# will return false for Cancel and Result for all futures
print("Cancel: "+str(d.cancel()))
print("Result: "+str(d.result()))
I don't know why concurrent.futures.Future does not have a .kill() method, but you can accomplish what you want by shutting down the process pool with pool.shutdown(wait=False), and killing the remaining child processes by hand.
Create a function for killing child processes:
import signal, psutil
def kill_child_processes(parent_pid, sig=signal.SIGTERM):
try:
parent = psutil.Process(parent_pid)
except psutil.NoSuchProcess:
return
children = parent.children(recursive=True)
for process in children:
process.send_signal(sig)
Run your code until you get the first result, then kill all remaining child processes:
from concurrent.futures import ProcessPoolExecutor, FIRST_COMPLETED, wait
# function that profits from partitioned search space
def m_run(partition):
for elem in partition:
if elem == 135135515:
return elem
return False
futures = []
# used to create the partitions
steps = 100000000
pool = ProcessPoolExecutor(max_workers=4)
for i in range(4):
# run 4 tasks with a partition, but only *one* solution is needed
partition = range(i*steps,(i+1)*steps)
futures.append(pool.submit(m_run, partition))
done, not_done = wait(futures, timeout=3600, return_when=FIRST_COMPLETED)
# Shut down pool
pool.shutdown(wait=False)
# Kill remaining child processes
kill_child_processes(os.getpid())
Unfortunately, running Futures cannot be cancelled. I believe the core reason is to ensure the same API over different implementations (it's not possible to interrupt running threads or coroutines).
The Pebble library was designed to overcome this and other limitations.
from pebble import ProcessPool
def function(foo, bar=0):
return foo + bar
with ProcessPool() as pool:
future = pool.schedule(function, args=[1])
# if running, the container process will be terminated
# a new process will be started consuming the next task
future.cancel()
I found your question interesting so here's my finding.
I found the behaviour of .cancel() method is as stated in python documentation. As for your running concurrent functions, unfortunately they could not be cancelled even after they were told to do so. If my finding is correct, then I reason that Python does require a more effective .cancel() method.
Run the code below to check my finding.
from concurrent.futures import ProcessPoolExecutor, as_completed
from time import time
# function that profits from partitioned search space
def m_run(partition):
for elem in partition:
if elem == 3351355150:
return elem
break #Added to terminate loop once found
return False
start = time()
futures = []
# used to create the partitions
steps = 1000000000
with ProcessPoolExecutor(max_workers=4) as pool:
for i in range(4):
# run 4 tasks with a partition, but only *one* solution is needed
partition = range(i*steps,(i+1)*steps)
futures.append(pool.submit(m_run, partition))
### New Code: Start ###
for f in as_completed(futures):
print(f.result())
if f.result():
print('break')
break
for f in futures:
print(f, 'running?',f.running())
if f.running():
f.cancel()
print('Cancelled? ',f.cancelled())
print('New Instruction Ended at = ', time()-start )
print('Total Compute Time = ', time()-start )
Update:
It is possible to forcefully terminate the concurrent processes via bash, but the consequence is that the main python program will terminate too. If this isn't an issue with you, then try the below code.
You have to add the below codes between the last 2 print statements to see this for yourself. Note: This code works only if you aren't running any other python3 program.
import subprocess, os, signal
result = subprocess.run(['ps', '-C', 'python3', '-o', 'pid='],
stdout=subprocess.PIPE).stdout.decode('utf-8').split()
print ('result =', result)
for i in result:
print('PID = ', i)
if i != result[0]:
os.kill(int(i), signal.SIGKILL)
try:
os.kill(int(i), 0)
raise Exception("""wasn't able to kill the process
HINT:use signal.SIGKILL or signal.SIGABORT""")
except OSError as ex:
continue

Multiprocessing Pool return results as available

I'm trying to implement multiprocessing and struggling to get where I need to get.
So some background I have previously done multiprocessing with Celery, so I am used to being able to send jobs to a worker and poll when its done, and get the results of the job even if other jobs are going. I'm trying to relate this to multiprocessing. Here is what I have so far, dug up from various sites I have found...
import urllib2
import time
from multiprocessing.dummy import Pool as ThreadPool
import random
def openurl(url):
time.sleep(random.randrange(1,10))
print url
return urllib2.urlopen(url)
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc..
]
pool = ThreadPool(20)
results = pool.map(openurl, urls)
pool.close()
pool.join()
print 'now what'
So I am kicking off the openurl function on my urls, but if I break at "print 'now what'", it does not break there until all my jobs are complete.
How can I 'poll' my threads and return the results as they come in?
Thanks!
pool.map distributes the iterable elements over a Pool of Workers and aggregates the results when they are all ready.
Moreover, pool.close and pool.join instruct the Pool to wait until all the tasks are done.
If you want to handle the results as they come, you have to use pool.apply_async and use a callback. Or you can collect the AsyncResult objects returned by pool.apply_async and iterate over them to see when each of those is ready but the whole logic would be quite cumbersome.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(20)
tasks = []
def callback(result):
# handle the result of your function here
print result
for url in urls:
pool.apply_async(openurl, args=[url], callback=callback)
pool.close()
pool.join()

Categories

Resources