Python mulithreading concurrent.futures - python

My problem is, whenever I use thr.results() the program acts like its running on one thread. But when i don't you use thr.results() it will use x threads
so if I remove my if statement, it will run on 10 threads, if I have it in there it will act like its on one 1 thread
def search(query):
r = requests.get("https://www.google.com/search?q=" + query)
return r.status_code
pool = ThreadPoolExecutor(max_workers=10)
for i in range(50):
thr = pool.submit(search, "stocks")
print(i)
if thr.result() != 404:
print("Ran")
pool.shutdown(wait=True)

That's because result will wait for the future to complete:
Return the value returned by the call. If the call hasn’t yet completed then this method will wait up to timeout seconds. If the call hasn’t completed in timeout seconds, then a concurrent.futures.TimeoutError will be raised. timeout can be an int or float. If timeout is not specified or None, there is no limit to the wait time.
When you have result within a loop you submit a task, then wait it to complete and then submit another one so there can be only one task running at a time.
Update You can either store the returned futures to a list and iterate over them once you have submitted all the task. Other option is to use map:
from concurrent.futures import ThreadPoolExecutor
import time
def square(x):
time.sleep(0.3)
return x * x
print(time.time())
with ThreadPoolExecutor(max_workers=3) as pool:
for res in pool.map(square, range(10)):
print(res)
print(time.time())
Output:
1485845609.983702
0
1
4
9
16
25
36
49
64
81
1485845611.1942203

Related

Python Multiprocessing or Thread

My code
import time
from multiprocessing.pool import ThreadPool
from concurrent.futures import ThreadPoolExecutor
def print_function(tests):
while True:
print tests
time.sleep(2)
executor = ThreadPoolExecutor(max_workers=2)
for i in range(5):
a = executor.submit(print_function(i))
output
0 0 0 0 0 0 0 0...
but I want out 012345, 012345, 012345...
How can I do this ?
In the line
a = executor.submit(print_function(i))
^^^^^^^^^^^^^^^^^
you are calling the function already. Since it has a while True, it will never finish and thus submit() will never be reached.
The solution is to pass the function as a reference and the argument separately:
a = executor.submit(print_function, i)
However, notice that you will not get the output you like (012345), since a) the range will stop at 4 and b) you kick off only 2 workers and c) the operating system will choose which process to run, so that will be seemingly random (more like 310254).

How to correctly handle exceptions in multiprocessing

It is possible to retrieve the outputs of workers with Pool.map, but when one worker fails, an exception is raised and it's not possible to retrieve the outputs anymore. So, my idea was to log the outputs in a process-synchronized queue so as to retrieve the outputs of all successful workers.
The following snippet seems to work:
from multiprocessing import Pool, Manager
from functools import partial
def f(x, queue):
if x == 4:
raise Exception("Error")
queue.put_nowait(x)
if __name__ == '__main__':
queue = Manager().Queue()
pool = Pool(2)
try:
pool.map(partial(f, queue=queue), range(6))
pool.close()
pool.join()
except:
print("An error occurred")
while not queue.empty():
print("Output => " + str(queue.get()))
But I was wondering whether a race condition could occur during the queue polling phase. I'm not sure whether the queue process will necessarily be alive when all workers have completed. Do you think my code is correct from that point of view?
As far as "how to correctly handle exceptions", which is your main question:
First, in your case, you will never get to execute pool.close and pool.join. But pool.map will not return until all the submitted tasks have returned their results or generated an exception, so you really don't need to call these to be sure that all of your submitted tasks have been completed. If it weren't for worker function f writing the results to a queue, you would never be able to get any results back using map as long as long as any of your tasks resulted in an exception. You would instead have to apply_async individual tasks and get AsyncResult instances for each one.
So I would say that a better way of handling exceptions in you worker functions without having to resort to using a queue would be as follows. But note that when you use apply_async, tasks are being submitted one task at a time, which can result in many shared memory accesses. This becomes a performance issue really only when the number of tasks being submitted is very large. In this case, it would be better for worker functions to handle the exceptions themselves and somehow pass back the error indication to allow the use of map or imap, where you could specify a chunksize.
When using a queue, be aware that writing to a managed queue has fair bit of overhead. The second piece of code shows how you can reduce that overhead a bit by using a multiprocessing.Queue instance, which does not use a proxy unlike the managed queue. Note the output order, which is not the order in which the tasks were submitted but rather the order in which tasks completed -- another potential downside or upside to using a queue (you can use a callback function with apply_async if you want the results in the order completed). Even with your original code you should not depend on the order of results in the queue.
from multiprocessing import Pool
def f(x):
if x == 4:
raise Exception("Error")
return x
if __name__ == '__main__':
pool = Pool(2)
results = [pool.apply_async(f, args=(x,)) for x in range(6)]
for x, result in enumerate(results): # result is AsyncResult instance:
try:
return_value = result.get()
except:
print(f'An error occurred for x = {x}')
else:
print(f'For x = {x} the return value is {return_value}')
Prints:
For x = 0 the return value is 0
For x = 1 the return value is 1
For x = 2 the return value is 2
For x = 3 the return value is 3
An error occurred for x = 4
For x = 5 the return value is 5
OP's Original Code Modified to Use multiprocessing.Queue
from multiprocessing import Pool, Queue
def init_pool(q):
global queue
queue = q
def f(x):
if x == 4:
raise Exception("Error")
queue.put_nowait(x)
if __name__ == '__main__':
queue = Queue()
pool = Pool(2, initializer=init_pool, initargs=(queue,))
try:
pool.map(f, range(6))
except:
print("An error occurred")
while not queue.empty():
print("Output => " + str(queue.get()))
Prints:
An error occurred
Output => 0
Output => 2
Output => 3
Output => 1
Output => 5

Run process after process in a queue using python

I have a queue of 500 processes that I want to run through a python script, I want to run every N processes in parallel.
What my python script does so far:
It runs N processes in parallel, waits for all of them to terminate, then runs the next N files.
What I need to do:
When one of the N processes is finished, another process from the queue is automatically started, without waiting for the rest of the processes to terminate.
Note: I do not know how much time each process will take, so I can't schedule a process to run at a particular time.
Following is the code that I have.
I am currently using subprocess.Popen, but I'm not limited to its use.
for i in range(0, len(queue), N):
batch = []
for _ in range(int(jobs)):
batch.append(queue.pop(0))
for process in batch:
p = subprocess.Popen([process])
ps.append(p)
for p in ps:
p.communicate()
I believe this should work:
import subprocess
import time
def check_for_done(l):
for i, p in enumerate(l):
if p.poll() is not None:
return True, i
return False, False
processes = list()
N = 5
queue = list()
for process in queue:
p = subprocess.Popen(process)
processes.append(p)
if len(processes) == N:
wait = True
while wait:
done, num = check_for_done(processes)
if done:
processes.pop(num)
wait = False
else:
time.sleep(0.5) # set this so the CPU does not go crazy
So you have an active process list, and the check_for_done function loops through it, the subprocess returns None if it is not finished and it returns a return code if it is. So when something is returned it should be done (without knowing if it was successful or not). Then you remove that process from the list allowing for the loop to add another one.
Assuming python3, you could make use of ThreadPoolExecutor from concurrent.futures like,
$ cat run.py
from subprocess import Popen, PIPE
from concurrent.futures import ThreadPoolExecutor
def exec_(cmd):
proc = Popen(cmd, stdout=PIPE, stderr=PIPE)
stdout, stderr = proc.communicate()
#print(stdout, stderr)
def main():
with ThreadPoolExecutor(max_workers=4) as executor:
# to demonstrate it will take a batch of 4 jobs at the same time
cmds = [['sleep', '4'] for i in range(10)]
start = time.time()
futures = executor.map(exec_, cmds)
for future in futures:
pass
end = time.time()
print(f'Took {end-start} seconds')
if __name__ == '__main__':
main()
This will process 4 tasks at a time, and since the number of tasks are 10, it should only take around 4 + 4 + 4 = 12 seconds
First 4 seconds for the first 4 tasks
Seconds 4 seconds for the second 4 tasks
And the final 4 seconds for the last 2 tasks remaining
Output:
$ python run.py
Took 12.005989074707031 seconds

Multiprocessing Running Slower than a Single Process

I'm attempting to use multiprocessing to run many simulations across multiple processes; however, the code I have written only uses 1 of the processes as far as I can tell.
Updated
I've gotten all the processes to work (I think) thanks to #PaulBecotte ; however, the multiprocessing seems to run significantly slower than its non-multiprocessing counterpart.
For instance, not including the function and class declarations/implementations and imports, I have:
def monty_hall_sim(num_trial, player_type='AlwaysSwitchPlayer'):
if player_type == 'NeverSwitchPlayer':
player = NeverSwitchPlayer('Never Switch Player')
else:
player = AlwaysSwitchPlayer('Always Switch Player')
return (MontyHallGame().play_game(player) for trial in xrange(num_trial))
def do_work(in_queue, out_queue):
while True:
try:
f, args = in_queue.get()
ret = f(*args)
for result in ret:
out_queue.put(result)
except:
break
def main():
logging.getLogger().setLevel(logging.ERROR)
always_switch_input_queue = multiprocessing.Queue()
always_switch_output_queue = multiprocessing.Queue()
total_sims = 20
num_processes = 5
process_sims = total_sims/num_processes
with Timer(timer_name='Always Switch Timer'):
for i in xrange(num_processes):
always_switch_input_queue.put((monty_hall_sim, (process_sims, 'AlwaysSwitchPlayer')))
procs = [multiprocessing.Process(target=do_work, args=(always_switch_input_queue, always_switch_output_queue)) for i in range(num_processes)]
for proc in procs:
proc.start()
always_switch_res = []
while len(always_switch_res) != total_sims:
always_switch_res.append(always_switch_output_queue.get())
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\tLength of Always Switch Result List: {alw_sw_len}'.format(alw_sw_len=len(always_switch_res))
print '\tThe success average of switching doors was: {alw_sw_prob}'.format(alw_sw_prob=always_switch_success)
which yields:
Time Elapsed: 1.32399988174 seconds
Length: 20
The success average: 0.6
However, I am attempting to use this for total_sims = 10,000,000 over num_processes = 5, and doing so has taken significantly longer than using 1 process (1 process returned in ~3 minutes). The non-multiprocessing counterpart I'm comparing it to is:
def main():
logging.getLogger().setLevel(logging.ERROR)
with Timer(timer_name='Always Switch Monty Hall Timer'):
always_switch_res = [MontyHallGame().play_game(AlwaysSwitchPlayer('Monty Hall')) for x in xrange(10000000)]
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\n\tThe success average of not switching doors was: {not_switching}' \
'\n\tThe success average of switching doors was: {switching}'.format(not_switching=never_switch_success,
switching=always_switch_success)
You could try import “process “ under some if statements
EDIT- you changed some stuff, let me try and explain a bit better.
Each message you put into the input queue will cause the monty_hall_sim function to get called and send num_trial messages to the output queue.
So your original implementation was right- to get 20 output messages, send in 5 input messages.
However, your function is slightly wrong.
for trial in xrange(num_trial):
res = MontyHallGame().play_game(player)
yield res
This will turn the function into a generator that will provide a new value on each next() call- great! The problem is here
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
out_queue.put(ret.next())
except:
break
Here, on each pass through the loop you create a NEW generator with a NEW message. The old one is thrown away. So here, each input message only adds a single output message to the queue before you throw it away and get another one. The correct way to write this is-
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
for result in ret:
out_queue.put(ret.next())
except:
break
Doing it this way will continue to yield output messages from the generator until it finishes (after yielding 4 messages in this case)
I was able to get my code to run significantly faster by changing monty_hall_sim's return to a list comprehension, having do_work add the lists to the output queue, and then extend the results list of main with the lists returned by the output queue. Made it run in ~13 seconds.

Python wait for processes in multiprocessing Pool to complete without either closing Pool or use map()

I have a code piece like below
pool = multiprocessing.Pool(10)
for i in range(300):
for m in range(500):
data = do_some_calculation(resource)
pool.apply_async(paralized_func, data, call_back=update_resource)
# need to wait for all processes finish
# {...}
# Summarize resource
do_something_with_resource(resource)
So basically I have 2 loops. I init process pool once outside the loops to avoid overheating. At the end of 2nd loop, I want to summarize the result of all processes.
Problem is that I can't use pool.map() to wait because of variation of data input. I can't use pool.join() and pool.close() either because I still need to use the pool in next iteration of 1st loop.
What is the good way to wait for processes to finish in this case?
I tried checking for pool._cache at the end of 2nd loop.
while len(process_pool._cache) > 0:
sleep(0.001)
This way works but look weird. Is there a better way to do this?
apply_async will return an AsyncResult object. This object has a method wait([timeout]), you can use it.
Example:
pool = multiprocessing.Pool(10)
for i in range(300):
results = []
for m in range(500):
data = do_some_calculation(resource)
result = pool.apply_async(paralized_func, data, call_back=update_resource)
results.append(result)
[result.wait() for result in results]
# need to wait for all processes finish
# {...}
# Summarize resource
do_something_with_resource(resource)
I haven't checked this code as it is not executable, but it should work.
There's an issue with most upvoted answer
[result.wait() for result in results]
will not work as a roadblock in case some of the workers raised an exception. Exception considered sufficient case to proceed further for wait(). Here's possible check if all workers finished processing.
while True:
time.sleep(1)
# catch exception if results are not ready yet
try:
ready = [result.ready() for result in results]
successful = [result.successful() for result in results]
except Exception:
continue
# exit loop if all tasks returned success
if all(successful):
break
# raise exception reporting exceptions received from workers
if all(ready) and not all(successful):
raise Exception(f'Workers raised following exceptions {[result._value for result in results if not result.successful()]}')
Or you can use a callback to record how many returns you have got.
pool = multiprocessing.Pool(10)
for i in range(300):
results = 0
for m in range(500):
data = do_some_calculation(resource)
result = pool.apply_async(paralized_func, data, call_back=lambda x: results+=1; )
results.append(result)
# need to wait for all processes finish
while results < 500:
pass
# Summarize resource
do_something_with_resource(resource)

Categories

Resources