asyncio: Scheduling work items that schedule other work items - python

I am writing a Python program which schedules a number of asynchronous, I/O-bound items to occur, many of which will also be scheduling other, similar work items. The work items themselves are completely independent of one another and they do not require each others' results to be complete, nor do I need to gather any results from them for any sort of local output (beyond logging, which takes place as part of the work items themselves).
I was originally using a pattern like this:
async def some_task(foo):
pending = []
for x in foo:
# ... do some work ...
if some_condition:
pending.append(some_task(bar))
if pending:
await asyncio.wait(pending)
However, I was running into trouble with some of the nested asyncio.wait(pending) calls sometimes hanging forever, even though the individual things being awaited were always completing (according to the debug output that was produced when I used KeyboardInterrupt to list out the state of the un-gathered results, which showed all of the futures as being in the done state). When I asked others for help they said I should be using asyncio.create_task instead, but I am not finding any useful information about how to do this nor have I been able to get clarification from the people who suggested this.
So, how can I satisfy this use case?

Python asyncio.Queue may help to tie your program processing to program completion. It has a join() method which will block until all items in the queue have been received and processed.
Another benefit that I like is that the worker becomes more explicit as it pulls from a queue processes, potentially adds more items, and then ACKS, but this is just personal preference.
async def worker(q):
while True:
item = await queue.get()
# process item potentially requeue more work
if some_condition:
await q.put('something new')
queue.task_done()
async def run():
queue = asyncio.Queue()
worker = asyncio.ensure_future(worker(queue))
await queue.join()
worker.cancel()
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()
The example above was adapted from asyncio producer_consumer example and modified since your worker both consumes and produces:
https://asyncio.readthedocs.io/en/latest/producer_consumer.html
I'm not super sure how to fix your specific example but I would def look at the primitives that asyncio offers to help the event loop hook into your program state, notably join and using a Queue.

Related

Why does `await coro()` block but `await task` does not? [duplicate]

Disclaimer: this is my first time experimenting with the asyncio module.
I'm using asyncio.wait in the following manner to try to support a timeout feature waiting for all results from a set of async tasks. This is part of a larger library so I'm omitting some irrelevant code.
Note that the library already supports submitting tasks and using timeouts with ThreadPoolExecutors and ProcessPoolExecutors, so I'm not really interested in suggestions to use those instead or questions about why I'm doing this with asyncio. On to the code...
import asyncio
from contextlib import suppress
...
class AsyncIOSubmit(Node):
def get_results(self, futures, timeout=None):
loop = asyncio.get_event_loop()
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=timeout)
)
if timeout and unfinished:
# Code options in question would go here...see below.
raise asyncio.TimeoutError
At first I was not worrying about cancelling pending tasks on timeout, but then I got the warning Task was destroyed but it is pending! on program exit or loop.close. After researching a bit I found multiple ways to cancel tasks and wait for them to actually be cancelled:
Option 1:
[task.cancel() for task in unfinished]
for task in unfinished:
with suppress(asyncio.CancelledError):
loop.run_until_complete(task)
Option 2:
[task.cancel() for task in unfinished]
loop.run_until_complete(asyncio.wait(unfinished))
Option 3:
# Not really an option for me, since I'm not in an `async` method
# and don't want to make get_results an async method.
[task.cancel() for task in unfinished]
for task in unfinished:
await task
Option 4:
Some sort of while loop like in this answer. Seems like my other options are better but including for completeness.
Options 1 and 2 both seem to work fine so far. Either option may be "right", but with asyncio evolving over the years the examples and suggestions around the net are either outdated or vary quite a bit. So my questions are...
Question 1
Are there any practical differences between Options 1 and 2? I know run_until_complete will run until the future has completed, so since Option 1 is looping in a specific order I suppose it could behave differently if earlier tasks take longer to actually complete. I tried looking at the asyncio source code to understand if asyncio.wait just effectively does the same thing with its tasks/futures under the hood, but it wasn't obvious.
Question 2
I assume if one of the tasks is in the middle of a long-running blocking operation it may not actually cancel immediately? Perhaps that just depends on if the underlying operation or library being used will raise the CancelledError right away or not? Maybe that should never happen with libraries designed for asyncio?
Since I'm trying to implement a timeout feature here I'm somewhat sensitive to this. If it's possible these things could take a long time to cancel I'd consider calling cancel and not waiting for it to actually happen, or setting a very short timeout to wait for the cancels to finish.
Question 3
Is it possible loop.run_until_complete (or really, the underlying call to async.wait) returns values in unfinished for a reason other than a timeout? If so I'd obviously have to adjust my logic a bit, but from the docs it seems like that is not possible.
Are there any practical differences between Options 1 and 2?
No. Option 2 looks nicer and might be marginally more efficient, but their net effect is the same.
I know run_until_complete will run until the future has completed, so since Option 1 is looping in a specific order I suppose it could behave differently if earlier tasks take longer to actually complete.
It seems that way at first, but it's not actually the case because loop.run_until_complete runs all tasks submitted to the loop, not just the one passed as argument. It merely stops once the provided awaitable completes - that is what "run until complete" refers to. A loop calling run_until_complete over already scheduled tasks is like the following async code:
ts = [asyncio.create_task(asyncio.sleep(i)) for i in range(1, 11)]
# takes 10s, not 55s
for t in ts:
await t
which is in turn semantically equivalent to the following threaded code:
ts = []
for i in range(1, 11):
t = threading.Thread(target=time.sleep, args=(i,))
t.start()
ts.append(t)
# takes 10s, not 55s
for t in ts:
t.join()
In other words, await t and run_until_complete(t) block until t has completed, but allow everything else - such as tasks previously scheduled using asyncio.create_task() to run during that time as well. So the total run time will equal the run time of the longest task, not of their sum. For example, if the first task happens to take a long time, all others will have finished in the meantime, and their awaits won't sleep at all.
All this only applies to awaiting tasks that have been previously scheduled. If you try to apply that to coroutines, it won't work:
# runs for 55s, as expected
for i in range(1, 11):
await asyncio.sleep(i)
# also 55s - we didn't call create_task() so it's equivalent to the above
ts = [asyncio.sleep(i) for i in range(1, 11)]
for t in ts:
await t
# also 55s
for i in range(1, 11):
t = threading.Thread(target=time.sleep, args=(i,))
t.start()
t.join()
This is often a sticking point for asyncio beginners, who write code equivalent to that last asyncio example and expect it to run in parallel.
I tried looking at the asyncio source code to understand if asyncio.wait just effectively does the same thing with its tasks/futures under the hood, but it wasn't obvious.
asyncio.wait is just a convenience API that does two things:
converts the input arguments to something that implements Future. For coroutines that means that it submits them to the event loop, as if with create_task, which allows them to run independently. If you give it tasks to begin with, as you do, this step is skipped.
uses add_done_callback to be notified when the futures are done, at which point it resumes its caller.
So yes, it does the same things, but with a different implementation because it supports many more features.
I assume if one of the tasks is in the middle of a long-running blocking operation it may not actually cancel immediately?
In asyncio there shouldn't be "blocking" operations, only those that suspend, and they should be cancelled immediately. The exception to this is blocking code tacked onto asyncio with run_in_executor, where the underlying operation won't cancel at all, but the asyncio coroutine will immediately get the exception.
Perhaps that just depends on if the underlying operation or library being used will raise the CancelledError right away or not?
The library doesn't raise CancelledError, it receives it at the await point where it happened to suspend before cancellation occurred. For the library the effect of the cancellation is await ... interrupting its wait and immediately raising CancelledError. Unless caught, the exception will propagate through function and await calls all the way to the top-level coroutine, whose raising CancelledError marks the whole task as cancelled. Well-behaved asyncio code will do just that, possibly using finally to release OS-level resources they hold. When CancelledError is caught, the code can choose not to re-raise it, in which case cancellation is effectively ignored.
Is it possible loop.run_until_complete (or really, the underlying call to async.wait) returns values in unfinished for a reason other than a timeout?
If you're using return_when=asyncio.ALL_COMPLETE (the default), that shouldn't be possible. It is quite possible with return_when=FIRST_COMPLETED, then it is obviously possible independently of timeout.

Python asyncio: Enter into a temporary async context?

I want to write a library that mixes synchronous and asynchronous work, like:
def do_things():
# 1) do sync things
# 2) launch a bunch of slow async tasks and block until they are all complete or an exception is thrown
# 3) more sync work
# ...
I started implementing this using asyncio as an excuse to learn the learn the library, but as I learn more it seems like this may be the wrong approach. My problem is that there doesn't seem to be a clean way to do 2, because it depends on the context of the caller. For example:
I can't use asyncio.run(), because the caller could already have a running event loop and you can only have one loop per thread.
Marking do_things as async is too heavy because it shouldn't require the caller to be async. Plus, if do_things was async, calling synchronous code (1 & 3) from an async function seems to be bad practice.
asyncio.get_event_loop() also seems wrong, because it may create a new loop, which if left running would prevent the caller from creating their own loop after calling do_things (though arguably they shouldn't do that). And based on the documentation of loop.close, it looks like starting/stopping multiple loops in a single thread won't work.
Basically it seems like if I want to use asyncio at all, I am forced to use it for the entire lifetime of the program, and therefore all libraries like this one have to be written as either 100% synchronous or 100% asynchronous. But the behavior I want is: Use the current event loop if one is running, otherwise create a temporary one just for the scope of 2, and don't break client code in doing so. Does something like this exist, or is asyncio the wrong choice?
I can't use asyncio.run(), because the caller could already have a running event loop and you can only have one loop per thread.
If the caller has a running event loop, you shouldn't run blocking code in the first place because it will block the caller's loop!
With that in mind, your best option is to indeed make do_things async and call sync code using run_in_executor which is designed precisely for that use case:
async def do_things():
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, sync_stuff)
await async_func()
await loop.run_in_executor(None, more_sync_stuff)
This version of do_things is usable from async code as await do_things() and from sync code as asyncio.run(do_things()).
Having said that... if you know that the sync code will run very briefly, or you are for some reason willing to block the caller's event loop, you can work around the limitation by starting an event loop in a separate thread:
def run_async(aw):
result = None
async def run_and_store_result():
nonlocal result
result = await aw
t = threading.Thread(target=asyncio.run, args=(run_and_store_result(),))
t.start()
t.join()
return result
do_things can then look like this:
async def do_things():
sync_stuff()
run_async(async_func())
more_sync_stuff()
It will be callable from both sync and async code, but the cost will be that:
it will create a brand new event loop each and every time. (Though you can cache the event loop and never exit it.)
when called from async code, it will block the caller's event loop, thus effectively breaking its asyncio usage, even if most time is actually spent inside its own async code.

Handling Timeouts with asyncio

Disclaimer: this is my first time experimenting with the asyncio module.
I'm using asyncio.wait in the following manner to try to support a timeout feature waiting for all results from a set of async tasks. This is part of a larger library so I'm omitting some irrelevant code.
Note that the library already supports submitting tasks and using timeouts with ThreadPoolExecutors and ProcessPoolExecutors, so I'm not really interested in suggestions to use those instead or questions about why I'm doing this with asyncio. On to the code...
import asyncio
from contextlib import suppress
...
class AsyncIOSubmit(Node):
def get_results(self, futures, timeout=None):
loop = asyncio.get_event_loop()
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=timeout)
)
if timeout and unfinished:
# Code options in question would go here...see below.
raise asyncio.TimeoutError
At first I was not worrying about cancelling pending tasks on timeout, but then I got the warning Task was destroyed but it is pending! on program exit or loop.close. After researching a bit I found multiple ways to cancel tasks and wait for them to actually be cancelled:
Option 1:
[task.cancel() for task in unfinished]
for task in unfinished:
with suppress(asyncio.CancelledError):
loop.run_until_complete(task)
Option 2:
[task.cancel() for task in unfinished]
loop.run_until_complete(asyncio.wait(unfinished))
Option 3:
# Not really an option for me, since I'm not in an `async` method
# and don't want to make get_results an async method.
[task.cancel() for task in unfinished]
for task in unfinished:
await task
Option 4:
Some sort of while loop like in this answer. Seems like my other options are better but including for completeness.
Options 1 and 2 both seem to work fine so far. Either option may be "right", but with asyncio evolving over the years the examples and suggestions around the net are either outdated or vary quite a bit. So my questions are...
Question 1
Are there any practical differences between Options 1 and 2? I know run_until_complete will run until the future has completed, so since Option 1 is looping in a specific order I suppose it could behave differently if earlier tasks take longer to actually complete. I tried looking at the asyncio source code to understand if asyncio.wait just effectively does the same thing with its tasks/futures under the hood, but it wasn't obvious.
Question 2
I assume if one of the tasks is in the middle of a long-running blocking operation it may not actually cancel immediately? Perhaps that just depends on if the underlying operation or library being used will raise the CancelledError right away or not? Maybe that should never happen with libraries designed for asyncio?
Since I'm trying to implement a timeout feature here I'm somewhat sensitive to this. If it's possible these things could take a long time to cancel I'd consider calling cancel and not waiting for it to actually happen, or setting a very short timeout to wait for the cancels to finish.
Question 3
Is it possible loop.run_until_complete (or really, the underlying call to async.wait) returns values in unfinished for a reason other than a timeout? If so I'd obviously have to adjust my logic a bit, but from the docs it seems like that is not possible.
Are there any practical differences between Options 1 and 2?
No. Option 2 looks nicer and might be marginally more efficient, but their net effect is the same.
I know run_until_complete will run until the future has completed, so since Option 1 is looping in a specific order I suppose it could behave differently if earlier tasks take longer to actually complete.
It seems that way at first, but it's not actually the case because loop.run_until_complete runs all tasks submitted to the loop, not just the one passed as argument. It merely stops once the provided awaitable completes - that is what "run until complete" refers to. A loop calling run_until_complete over already scheduled tasks is like the following async code:
ts = [asyncio.create_task(asyncio.sleep(i)) for i in range(1, 11)]
# takes 10s, not 55s
for t in ts:
await t
which is in turn semantically equivalent to the following threaded code:
ts = []
for i in range(1, 11):
t = threading.Thread(target=time.sleep, args=(i,))
t.start()
ts.append(t)
# takes 10s, not 55s
for t in ts:
t.join()
In other words, await t and run_until_complete(t) block until t has completed, but allow everything else - such as tasks previously scheduled using asyncio.create_task() to run during that time as well. So the total run time will equal the run time of the longest task, not of their sum. For example, if the first task happens to take a long time, all others will have finished in the meantime, and their awaits won't sleep at all.
All this only applies to awaiting tasks that have been previously scheduled. If you try to apply that to coroutines, it won't work:
# runs for 55s, as expected
for i in range(1, 11):
await asyncio.sleep(i)
# also 55s - we didn't call create_task() so it's equivalent to the above
ts = [asyncio.sleep(i) for i in range(1, 11)]
for t in ts:
await t
# also 55s
for i in range(1, 11):
t = threading.Thread(target=time.sleep, args=(i,))
t.start()
t.join()
This is often a sticking point for asyncio beginners, who write code equivalent to that last asyncio example and expect it to run in parallel.
I tried looking at the asyncio source code to understand if asyncio.wait just effectively does the same thing with its tasks/futures under the hood, but it wasn't obvious.
asyncio.wait is just a convenience API that does two things:
converts the input arguments to something that implements Future. For coroutines that means that it submits them to the event loop, as if with create_task, which allows them to run independently. If you give it tasks to begin with, as you do, this step is skipped.
uses add_done_callback to be notified when the futures are done, at which point it resumes its caller.
So yes, it does the same things, but with a different implementation because it supports many more features.
I assume if one of the tasks is in the middle of a long-running blocking operation it may not actually cancel immediately?
In asyncio there shouldn't be "blocking" operations, only those that suspend, and they should be cancelled immediately. The exception to this is blocking code tacked onto asyncio with run_in_executor, where the underlying operation won't cancel at all, but the asyncio coroutine will immediately get the exception.
Perhaps that just depends on if the underlying operation or library being used will raise the CancelledError right away or not?
The library doesn't raise CancelledError, it receives it at the await point where it happened to suspend before cancellation occurred. For the library the effect of the cancellation is await ... interrupting its wait and immediately raising CancelledError. Unless caught, the exception will propagate through function and await calls all the way to the top-level coroutine, whose raising CancelledError marks the whole task as cancelled. Well-behaved asyncio code will do just that, possibly using finally to release OS-level resources they hold. When CancelledError is caught, the code can choose not to re-raise it, in which case cancellation is effectively ignored.
Is it possible loop.run_until_complete (or really, the underlying call to async.wait) returns values in unfinished for a reason other than a timeout?
If you're using return_when=asyncio.ALL_COMPLETE (the default), that shouldn't be possible. It is quite possible with return_when=FIRST_COMPLETED, then it is obviously possible independently of timeout.

Start processing async tasks while adding them to the event loop

A common pattern with asyncio, like the one shown here, is to add a collection of coroutines to a list, and then asyncio.gather them.
For instance:
async def some_task(i):
# Do something asynchronously with i
tasks = [some_task(i) for i in range(100)]
loop.run_until_complete(asyncio.gather(**tasks))
Here, the execution order of this code is such that none of the tasks are running while we build up the list. We add task 1 to the list, then task 2, etc. and then we add the tasks 1-100 to the event loop.
However, I want task creation itself to be part of the event loop. I want task 1 to be scheduled immediately as it's created, and then when task is waiting for something on another thread, return to task creation and create task 2 and add it to the event loop.
I believe this would give me better concurrency from my async code. Is this possible?
For example, my first thought would be to put task creation into a coroutine and schedule tasks as they are created:
async def some_task(i):
# Do something asynchronously with i
async def generate_tasks(loop):
tasks = []
for i in range(100):
task = loop.create_task(some_task(i))
tasks.append(loop)
await asyncio.gather(**tasks)
loop.run_until_complete(generate_tasks())
However, because my generate_tasks never uses await, execution is never passed back to the event loop, so the entirety of generate_tasks will run before some_task() is run at all.
But then, if I await each task as they are created, it will wait for each task to complete before moving on to the next task, giving me no concurrency at all!
async def generate_tasks(loop):
tasks = []
for i in range(100):
await some_task(i)
loop.run_until_complete(generate_tasks())
However, because my generate_tasks never uses await, execution is never passed back to the event loop
You can use await asyncio.sleep(0) to force yielding to the event loop inside for. But that is unlikely to make a difference, creating a task/coroutine pair is really efficient.
Before optimizing this, measure (with something as simple as time.time if need be) how much time it takes to execute the [some_task(i) for i in range(100)] list comprehension. Then consider whether dispersing that time (possibly making it take longer to finish due to increased scheduling overhead) will make any difference for your application. The results might surprise you.

asyncio with multiple processors [duplicate]

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?
You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.
Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res
See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Categories

Resources