Python - How to schedule ThreadPoolExecutor? - python

I'm using a concurrent.futures.ThreadPoolExecutor with a Queue, the code is something like this:
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
def func(parent):
return parent//2, parent//2, parent<=2
def worker(parent, q):
child1, child2, end = func(parent)
print(parent)
if not end:
q.put(child1)
q.put(child2)
if __name__ == "__main__":
q = Queue()
q.put(100)
executor = ThreadPoolExecutor(max_workers=6)
while True:
parent = q.get()
future = executor.submit(worker, parent, q)
if q.empty() and future.done():
break
The problem with this code is that the future.done() is never True, and I cannot get out of this infinite while loop.
My expected outcome is to wait until there's nothing to process. i.e. the queue is empty, and all worker have done their job, nothing further shall be put into this queue. Then I can stop this loop and do sth else.
P.S. the actual func I'm using is more complex than the above example, but the problem is the same.

Your problem is that you're not patient enough: The call to executor.submit returns immediately, even if the implied call to worker hasn't happened -- that is exactly the point of asynchronous constructs like futures. So when you check future.done() directly afterwards, there is a good chance that this check is performed before the executor had time to execute worker, meaning that your future won't be done, yet.
You can verify this by inserting the following code between the call to submit and your if statement:
import time
time.sleep(0.1)
This does achieve what you're looking for, but it doesn't really solve your problem in an elegant way.
Looking deeper, you're problem is that your scheduled tasks may generate new tasks, and that you only know whether they did so when they have completed. This means you have to wait until the task you just scheduled has executed before you can decide whether to stop scheduling new tasks:
if __name__ == "__main__":
q = Queue()
q.put(100)
with ThreadPoolExecutor(max_workers=6) as executor:
while not q.empty():
parent = q.get()
future = executor.submit(worker, parent, q)
future.result() # Wait for task
Also make sure to call Executor.shutdown, or better to use the executor in a context manager (as shown above) so that all resources are freed correctly once you're done.

Related

Python, ThreadPoolExecutor, pool execution doesnt terminate

I have got I simple code modelling a more complicated problem I am to solve. Here I have 3 funcs- worker, task submitter (seek tasks and put it to queue once it gets new ones) and function creating a pool and adding new tasks to this pool. But the code doesnt happen to finish the run after queue gets empty and all the tasks in a list turn finished.I am too dump to have an idea why the hell it doesnt terminate the While loop with condition... I have tried a different ways to code the thing, nothing works
from concurrent.futures import ThreadPoolExecutor as Tpe
import time
import random
import queue
import threading
def task_submit(q):
for i in range(7):
threading.currentThread().setName('task_submit')
new_task = random.randint(10, 20)
q.put_nowait(new_task)
print(f' {i} new task with argument {new_task} has been added to queue')
time.sleep(5)
def worker(t):
threading.currentThread().setName(f'worker {t}')
print(f'{threading.currentThread().getName()} started')
time.sleep(t)
print(f'{threading.currentThread().getName()} FINISHED!')
def execution():
executor = Tpe(max_workers=4)
q = queue.Queue(maxsize=100)
q_thread = executor.submit(task_submit, q)
tasks = [executor.submit(worker, q.get())]
execution_finished = False
while not execution_finished: #all([task.done() for task in tasks]):
if not all([task.done() for task in tasks]):
print(' still in progress .....................')
tasks.append(executor.submit(worker, q.get()))
else:
print(' all done!')
executor.shutdown()
execution_finished = True
execution()
It doesn't terminate because you are trying to remove an item from an empty queue. The problem is here:
while not execution_finished:
if not all([task.done() for task in tasks]):
print(' still in progress .....................')
tasks.append(executor.submit(worker, q.get()))
The last line here submits a new work item to the executor. Suppose that happens to be the last item in the queue. At that moment, the executor is not finished and will not be finished for a few seconds. Your main thread goes back to the while not execution_finished line, and the if statement evaluates true because some of the tasks are still running. So you try to submit one more item but you can't, because the queue is now empty. The call to q.get blocks the main loop until the queue contains an item, which never happens. The other threads finish but the program doesn't exit because the main thread is blocked.
Perhaps you should check for an empty queue, but I'm not sure that's the right idea because I probably don't understand your requirements. In any case, that's why your script doesn't exit.

Python multiprocessing map using with statement does not stop

I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.

How to return values from Process- or Thread instances?

So I want to run a function which can either search for information on the web or directly from my own mysql database.
The first process will be time-consuming, the second relatively fast.
With this in mind I create a process which starts this compound search (find_compound_view). If the process finishes relatively fast it means it's present on the database so I can render the results immediately. Otherwise, I will render "drax_retrieving_data.html".
The stupid solution I came up with was to run the function twice, once to check if the process takes a long time, the other to actually get the return values of the function. This is pretty much because I don't know how to return the values of my find_compound_view function. I've tried googling but I can't seem to find how to return the values from the class Process specifically.
p = Process(target=find_compound_view, args=(form,))
p.start()
is_running = p.is_alive()
start_time=time.time()
while is_running:
time.sleep(0.05)
is_running = p.is_alive()
if time.time() - start_time > 10 :
print('Timer exceeded, DRAX is retrieving info!',time.time() - start_time)
return render(request,'drax_internal_dbs/drax_retrieving_data.html')
compound = find_compound_view(form,use_email=False)
if compound:
data=*****
return render(request, 'drax_internal_dbs/result.html',data)
You will need a multiprocessing.Pipe or a multiprocessing.Queue to send the results back to your parent-process. If you just do I/0, you should use a Thread instead of a Process, since it's more lightweight and most time will be spend on waiting. I'm showing you how it's done for Process and Threads in general.
Process with Queue
The multiprocessing queue is build on top of a pipe and access is synchronized with locks/semaphores. Queues are thread- and process-safe, meaning you can use one queue for multiple producer/consumer-processes and even multiple threads in these processes. Adding the first item on the queue will also start a feeder-thread in the calling process. The additional overhead of a multiprocessing.Queue makes using a pipe for single-producer/single-consumer scenarios preferable and more performant.
Here's how to send and retrieve a result with a multiprocessing.Queue:
from multiprocessing import Process, Queue
SENTINEL = 'SENTINEL'
def sim_busy(out_queue, x):
for _ in range(int(x)):
assert 1 == 1
result = x
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Process(target=sim_busy, args=(out_queue, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel breaks the loop
print(result)
The queue is passed as argument into the function, results are .put() on the queue and the parent get.()s from the queue. .get() is a blocking call, execution does not resume until something is to get (specifying timeout parameter is possible). Note the work sim_busy does here is cpu-intensive, that's when you would choose processes over threads.
Process & Pipe
For one-to-one connections a pipe is enough. The setup is nearly identical, just the methods are named differently and a call to Pipe() returns two connection objects. In duplex mode, both objects are read-write ends, with duplex=False (simplex) the first connection object is the read-end of the pipe, the second is the write-end. In this basic scenario we just need a simplex-pipe:
from multiprocessing import Process, Pipe
SENTINEL = 'SENTINEL'
def sim_busy(write_conn, x):
for _ in range(int(x)):
assert 1 == 1
result = x
write_conn.send(result)
# If all results are send, send a sentinel-value to let the parent know
# no more results will come.
write_conn.send(SENTINEL)
if __name__ == '__main__':
# duplex=False because we just need one-way communication in this case.
read_conn, write_conn = Pipe(duplex=False)
p = Process(target=sim_busy, args=(write_conn, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(read_conn.recv, SENTINEL): # sentinel breaks the loop
print(result)
Thread & Queue
For use with threading, you want to switch to queue.Queue. queue.Queue is build on top of a collections.deque, adding some locks to make it thread-safe. Unlike with multiprocessing's queue and pipe, objects put on a queue.Queue won't get pickled. Since threads share the same memory address-space, serialization for memory-copying is unnecessary, only pointers are transmitted.
from threading import Thread
from queue import Queue
import time
SENTINEL = 'SENTINEL'
def sim_io(out_queue, query):
time.sleep(1)
result = query + '_result'
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Thread(target=sim_io, args=(out_queue, 'my_query'))
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel-value breaks the loop
print(result)
Read here why for result in iter(out_queue.get, SENTINEL):
should be prefered over a while True...break setup, where possible.
Read here why you should use if __name__ == '__main__': in all your scripts and especially in multiprocessing.
More about get()-usage here.

How do I feed an infinite generator to eventlet (or gevent)?

The docs of both eventlet and gevent have several examples on how to asyncronously spawn IO tasks and get the results latter.
But so far, all the examples where a value should be returned from the async call,I allways find a blocking call after all the calls to spawn(). Either join(), joinall(), wait(), waitall().
This assumes that calling the functions that use IO is immediate and we can jump right into the point where we are waiting for the results.
But in my case I want to get the jobs from a generator that can be slow and or arbitrarily large or even infinite.
I obviously can't do this
pile = eventlet.GreenPile(pool)
for url in mybiggenerator():
pile.spawn(fetch_title, url)
titles = '\n'.join(pile)
because mybiggenerator() can take a long time before it is exhausted. So I have to start consuming the results while I am still spawning async calls.
This is probably usually done with resource to queues, but I'm not really sure how. Say I create a queue to hold jobs, push a bunch of jobs from a greenlet called P and pop them from another greenlet C.
When in C, if I find that the queue is empty, how do I know if P has pushed every job it had to push or if it is just in the middle of an iteration?
Alternativey,Eventlet allows me to loop through a pile to get the return values, but can I start doing this without having spawn all the jobs I have to spawn? How? This would be a simpler alternative.
You don't need any pool or pile by default. They're just convenient wrappers to implement a particular strategy. First you should get idea how exactly your code must work under all circumstances, that is: when and why you start another greenthread, when and why wait for something.
When you have some answers to these questions and doubt in others, ask away. In the meanwhile, here's a prototype that processes infinite "generator" (actually a queue).
queue = eventlet.queue.Queue(10000)
wait = eventlet.semaphore.CappedSemaphore(1000)
def fetch(url):
# httplib2.Http().request
# or requests.get
# or urllib.urlopen
# or whatever API you like
return response
def crawl(url):
with wait:
response = fetch(url)
links = parse(response)
for url in link:
queue.put(url)
def spawn_crawl_next():
try:
url = queue.get(block=False)
except eventlet.queue.Empty:
return False
# use another CappedSemaphore here to limit number of outstanding connections
eventlet.spawn(crawl, url)
return True
def crawler():
while True:
if spawn_crawl_next():
continue
while wait.balance != 0:
eventlet.sleep(1)
# if last spawned `crawl` enqueued more links -- process them
if not spawn_crawl_next():
break
def main():
queue.put('http://initial-url')
crawler()
Re: "concurrent.futures from Python3 does not really apply to "eventlet or gevent" part."
In fact, eventlet can be combined to deploy the concurrent.futures ThreadPoolExecutor as a GreenThread executor.
See: https://github.com/zopefiend/green-concurrent.futures-with-eventlet/commit/aed3b9f17ac27eeaf8c56210e0c8e4aff2ecbdb5
I had the same problem and it has been super difficult to find any answers.
I think I managed to get something working by having a consumer running on a separate thread and using Event for synchronization. Seems to work fine.
Only caveat is that you have to be careful with monkey-patching. If you monkey-patch threading facilities this will probably not work.
import gevent
import gevent.queue
import threading
import time
q = gevent.queue.JoinableQueue()
queue_not_empty = threading.Event()
def run_task(task):
print(f"Started task {task} # {time.time()}")
# Use whatever has been monkey-patched with gevent here
gevent.sleep(1)
print(f"Finished task {task} # {time.time()}")
def consumer():
while True:
print("Waiting for item in queue")
queue_not_empty.wait()
try:
task = q.get()
print(f"Dequed task {task} for consumption # {time.time()}")
except gevent.exceptions.LoopExit:
queue_not_empty.clear()
continue
try:
gevent.spawn(run_task, task)
finally:
q.task_done()
gevent.sleep(0) # Kickstart task
def enqueue(item):
q.put(item)
queue_not_empty.set()
# Run consumer on separate thread
consumer_thread = threading.Thread(target=consumer, daemon=True)
consumer_thread.start()
# Add some tasks
for i in range(5):
enqueue(i)
time.sleep(2)
Output:
Waiting for item in queue
Dequed task 0 for consumption # 1643232632.0220542
Started task 0 # 1643232632.0222237
Waiting for item in queue
Dequed task 1 for consumption # 1643232632.0222733
Started task 1 # 1643232632.0222948
Waiting for item in queue
Dequed task 2 for consumption # 1643232632.022315
Started task 2 # 1643232632.02233
Waiting for item in queue
Dequed task 3 for consumption # 1643232632.0223525
Started task 3 # 1643232632.0223687
Waiting for item in queue
Dequed task 4 for consumption # 1643232632.022386
Started task 4 # 1643232632.0224123
Waiting for item in queue
Finished task 0 # 1643232633.0235817
Finished task 1 # 1643232633.0236874
Finished task 2 # 1643232633.0237293
Finished task 3 # 1643232633.0237558
Finished task 4 # 1643232633.0237799
Waiting for item in queue
With the new concurrent.futures module in Py3k, I would say (assuming that the processing you want to do is actually something more complex than join):
with concurrent.futures.ThreadPoolExecutor(max_workers=foo) as wp:
res = [wp.submit(fetchtitle, url) for url in mybiggenerator()]
ans = '\n'.join([a for a in concurrent.futures.as_completed(res)]
This will allow you to start processing results before all of your fetchtitle calls complete. However, it will require you to exhaust mybiggenerator before you continue -- it's not clear how you want to get around this, unless you want to set some max_urls parameter or similar. That would still be something you could do with your original implementation, though.

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.
I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)
Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)
Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.
You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Categories

Resources