I need to pass two different function to ThreadPoolExecutor and wait for both of them to complete
without blocking for loop for first or second future as these are very long running tasks .
How may i achieve this with ThreadPoolExecutor ?
from concurrent.futures import ThreadPoolExecutor, as_completed
def perform_set ():
pass
def perform_get ():
pass
with ThreadPoolExecutor(max_workers=4) as executor:
futures_set = [executor.submit(perform_set) for i in range(2)]
futures_get = [executor.submit(perform_get) for i in range(2)]
#for f in as_completed(futures_set):
# print(f.result())
Regards
Related
How do I add tqdm for the multiprocessing for loop here. Namely I want to wrap urls in tqdm():
jobs = []
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
for url in urls:
job = pool.apply_async(worker, (url, q))
jobs.append(job)
for job in jobs:
job.get()
pool.close()
pool.join()
The suggested solution on GitHub is this:
pbar = tqdm(total=100)
def update(*a):
pbar.update()
# tqdm.write(str(a))
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()
But my iterable is a list of URLs as opposed to a range like in the above. How do I translate the above solution to my for loop?
The easiest solution that is compatible with your current code is to just specify the callback argument to apply_async (and if there is a possibility of an exception in worker, then specify the error_callback argument too).
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
# So that the progress par does not proceed to quickly
# for demo purposes:
import time
time.sleep(1)
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
def my_callback(result):
pbar.update()
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
with tqdm(total=len(urls)) as pbar:
pool = Pool()
jobs = [
pool.apply_async(worker, (url,), callback=my_callback, error_callback=my_callback)
for url in urls
]
# You can delete the next two statements if you don't need
# to save the value of jobs.get() since the calls to
# pool.close() and pool.join() will wait for all submitted
# tasks to complete:
for job in jobs:
job.get()
pool.close()
pool.join()
Or instead of using apply_async, use imap (or imap_unordered if you do not care either about the results or the order of the results):
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
import time
time.sleep(1) # so that the progress par does not proceed to quickly:
return url
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
pool = Pool()
results = list(tqdm(pool.imap(worker, urls), total=len(urls)))
print(results)
pool.close()
pool.join()
Note
If you won't or can't use apply_async with a callback, then imap_unordered is to be preferred over imap, assuming you don't need to have the results returned in task-submission order, which imap is obliged to do. The potential problem with imap is that if for some reason the first task submitted to the pool were the last to complete, no results can be returned until that first submitted task finishes. When that occurs all the other submitted task will have already completed and so your progress bar will not move at all and then it will suddenly go from 0% to 100% as quickly as you can iterate the results.
Admittedly the above scenario is an extreme case not likely to occur too often, but you would still like the progress bar to advance as tasks complete regardless of that order of completion. For this and getting results back in task-submission order apply_async with a callback is probably best. The only drawback to apply_async is that if you have a very large number of tasks to submit, they cannot be "chunked up" (see the chunksize argument to imap and imap_unordered) without your doing your own chunking logic.
You can use Parallel and Delayed from Joblib and use tqdm in the following manner:
from multiprocessing import cpu_count
from joblib import Parallel, delayed
def process_urls(urls,i):
#define your function here
Call function using:
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
Parallel(n_jobs=cpu_count(), prefer='processes')(delayed(process_urls)(urls, i) for i in tqdm(range(len(urls.axes[0]))))
I want to call 4 methods at once so they run parallel-ly in Python. These methods make HTTP calls and do some basic operation like verify response. I want to call them at once so the time taken will be less. Say each method takes ~20min to run, I want all 4methods to return response in 20min and not 20*4 80min
It is important to note that the 4methods I'm trying to run in parallel are async functions. When I tried using ThreadPoolExecutor to run the 4methods in parallel I didn't see much difference in time taken.
Example code - edited from #tomerar comment below
from concurrent.futures import ThreadPoolExecutor
async def foo_1():
print("foo_1")
async def foo_2():
print("foo_2")
async def foo_3():
print("foo_3")
async def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [await foo_1,await foo_2,await foo_3,await foo_4]:
executor.submit(foo)
Looking for suggestions
You can use from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
def foo_1():
print("foo_1")
def foo_2():
print("foo_2")
def foo_3():
print("foo_3")
def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [foo_1,foo_2,foo_3,foo_4]:
executor.submit(foo)
You can use "multiprocessing" in python.
it's so simple
from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(solve1, [A]) # evaluate "solve1(A)"
result2 = pool.apply_async(solve2, [B]) # evaluate "solve2(B)"
answer1 = result1.get(timeout=10)
answer2 = result2.get(timeout=10)
you can see full details
I am running a piece of python code in which multiple threads are run through threadpool executor. Each thread is supposed to perform a task (fetch a webpage for example). What I want to be able to do is to terminate all threads, even if one of the threads fail. For instance:
with ThreadPoolExecutor(self._num_threads) as executor:
jobs = []
for path in paths:
kw = {"path": path}
jobs.append(executor.submit(start,**kw))
for job in futures.as_completed(jobs):
result = job.result()
print(result)
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
#Signal all threads to stop
Is it possible to do so? The results returned by threads are useless to me unless all of them are successful, so if even one of them fails, I would like to save some execution time of the rest of the threads and terminate them immediately. The actual code obviously is doing relatively lengthy tasks with a couple of failure points.
If you are done with threads and want to look into processes, then this peace of code here looks very promising and simple, almost the same syntax as thread, but with the multiprocessing module.
When the timeout flag expires the process is terminated, very convenient.
import multiprocessing
def get_page(*args, **kwargs):
# your web page downloading code goes here
def start_get_page(timeout, *args, **kwargs):
p = multiprocessing.Process(target=get_page, args=args, kwargs=kwargs)
p.start()
p.join(timeout)
if p.is_alive():
# stop the downloading 'thread'
p.terminate()
# and then do any post-error processing here
if __name__ == "__main__":
start_get_page(timeout, *args, **kwargs)
I have created an answer for a similar question I had, which I think will work for this question.
Terminate executor using ThreadPoolExecutor from concurrent.futures module
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep
NUM_REQUESTS = 100
def long_request(id):
sleep(1)
# Simulate bad response
if id == 10:
return {"data": {"valid": False}}
else:
return {"data": {"valid": True}}
def check_results(results):
valid = True
for result in results:
valid = result["data"]["valid"]
return valid
def main():
futures = []
responses = []
num_requests = 0
with ThreadPoolExecutor(max_workers=10) as executor:
for request_index in range(NUM_REQUESTS):
future = executor.submit(long_request, request_index)
# Future list
futures.append(future)
for future in as_completed(futures):
is_responses_valid = check_results(responses)
# Cancel all future requests if one invalid
if not is_responses_valid:
executor.shutdown(wait=False)
else:
# Append valid responses
num_requests += 1
responses.append(future.result())
return num_requests
if __name__ == "__main__":
requests = main()
print("Num Requests: ", requests)
In my code I used multiprocessing
import multiprocessing as mp
pool = mp.Pool()
for i in range(threadNumber):
pool.apply_async(publishMessage, args=(map_metrics, connection_parameters...,))
pool.close()
pool.terminate()
This is how I would do it:
import concurrent.futures
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
return False
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(start, {"path": path}) for path in paths]
concurrent.futures.wait(results, timeout=10, return_when=concurrent.futures.FIRST_COMPLETED)
for f in concurrent.futures.as_completed(results):
f_success = f.result()
if not f_success:
executor.shutdown(wait=False, cancel_futures=True) # shutdown if one fails
else:
#do stuff here
If any result is not True, everything will be shut down immediately.
You can try to use StoppableThread from func-timeout.
But terminating threads is strongly discouraged. And if you need to kill a thread, you probably have a design problem. Look at alternatives: asyncio coroutines and multiprocessing with legal cancel/terminating functionality.
I'm trying to use concurrent futures using the below example but my job never gets submitted. Don't see the print stmt in load_url.
import sys
from concurrent import futures
import multiprocessing
import time
import queue
def load_url(url,q):
# it will take 2 seconds to process a URL
print('load_url')
try:
time.sleep(2)
# put some dummy results in queue
for x in range(5):
print('put in queue')
q.put(x)
except Exception as e:
print('exception')
def main():
print('start')
manager = multiprocessing.Manager()
e = manager.Event()
q = queue.Queue()
with futures.ProcessPoolExecutor(max_workers=5) as executor:
livefutures = {executor.submit(load_url, url, q): url
for url in ['a','b']}
runningfutures = True
print('check_futures')
while runningfutures:
print('here')
runningfutures = [f for f in livefutures if f.running()]
if not runningfutures:
print('not running futures == ', q.empty())
while not q.empty():
print('not running futures1')
yield q.get(False)
if __name__ == '__main__':
for x in main():
print('x=',x)
Probably a bit late but I just ran into your post.
ProcessPoolExecutor is a bit picky, it requires the treads to execute simple functions and also sometimes behaves differently on Windows and Linux.
ThreadPoolExecutor is more permissive.
If you replace futures.ProcessPoolExecutor by futures.ThreadPoolExecutor it seems to work.
You are passing python's standard Queue to your asyncronous processes rather than a multiprocessing-safe Queue implementation. Therefore, your asyncronous job is failing with: TypeError: cannot pickle '_thread.lock' object. However, because you are not calling .result on the future object - this exception is never raised in the main process.
Instantiate your queue with manager.Queue() and the code works.
Im struggling to get multithreading working in Python. I have i function which i want to execute on 5 threads based on a parameter. I also needs 2 parameters that are the same for every thread. This is what i have:
from concurrent.futures import ThreadPoolExecutor
def do_something_parallel(sameValue1, sameValue2, differentValue):
print(str(sameValue1)) #same everytime
print(str(sameValue2)) #same everytime
print(str(differentValue)) #different
main():
differentValues = ["1000ms", "100ms", "10ms", "20ms", "50ms"]
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(do_something_parallel, sameValue1, sameValue2, differentValue) for differentValue in differentValues]
But i don't know what to do next
If you don't care about the order, you can now do:
from concurrent.futures import as_completed
# The rest of your code here
for f in as_completed(futures):
# Do what you want with f.result(), for example:
print(f.result())
Otherwise, if you care about order, it might make sense to use ThreadPoolExecutor.map with functools.partial to fill in the arguments that are always the same:
from functools import partial
# The rest of your code...
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(
partial(do_something_parallel, sameValue1, sameValue2),
differentValues
))