I am writing a function for my team that will download some data from the cloud. The function itself is a regular python function but under the hood, it uses asyncio. So, I create an event loop with in my function and have async coroutines do the downloading concurrently. After the data is downloaded, I process it and return the results.
My function works as expected when I call it from any other Python function. But, when I try to parallelize it using multiprocessing, I am occasionally seeing some IOErrors.
I tried to search for an example on how to achieve this but I couldn't find any. I only see recommendations to use a concurrent.futures and have the event loop's run_in_executor do the parallelization. That is not an options for me because, I want to hide all the async stuff from my team and just provide them this simple Python function that they can call from their code (possibly in multiprocessing). I have seen arguments online for why this is a bad idea, and why I shouldn't conceal the async stuff, but in my case, my team aren't savvy programmers. They would never use (or bother to understand) asyncio and so a simple python function is what works best for us.
Lastly, here is a pseudo example that shows what I am trying to do:
import asyncio
import aiohttp
from typing import List
async def _async_fetch_data(symbol: str) -> bytes:
'''
Download stock data for given symbol from yahoo finance.
'''
async with asyncio.BoundedSemaphore(50), aiohttp.ClientSession() as session:
try:
url = f'https://query1.finance.yahoo.com/v8/finance/chart/{symbol}?symbol={symbol}&period1=0&period2=9999999999&interval=1d'
async with session.get(url) as response:
return await response.read()
except:
return None
def fetch_data(symbols: List[str]) -> List[bytes]:
'''
Gateway function that wraps the under the hood async stuff
'''
coroutine_list = [_async_fetch_data(x) for x in symbols]
if len(coroutine_list) == 0:
return []
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
data = loop.run_until_complete(asyncio.wait(coroutine_list))[0]
loop.close()
return [d.result() for d in data if d.result() is not None]
This works alright if I run it as
>>> data = fetch_data(['AAPL', 'GOOG'])
But I am afraid if it will run alright when I do
>>> from multiprocessing import Pool as ProcessPool
>>> with ProcessPool(2) as pool:
data = [j for i in pool.map(fetch_data, [['AAPL', 'GOOG'], ['AMZN', 'MSFT']]) for j in i]
I am seeing occasional IOErrors but I cannot reproduce them and I am not sure if it is because I am mixing asyncio with multiprocessing or because of something else.
Related
I have been looking for an equivalent in Python to JavaScript's await Promise.all() functionality, which led me to asyncio.gather(). After having read a few explanations and followed a few examples, I haven't managed to get anything working asynchronously.
The task is straightforward: extract values remotely from multiple files from S3, then collect the results when all are finished. I have done this in JS and it takes little over a second to read from 12 files.
The code is written for FastAPI, and a simplified form of it is below. The reason I know that this is not working asynchronously is that the more files in s3 it reads from, the longer this takes.
I have seen documentation for this kind of thing, but as it is not working for me I am not sure if I am doing something wrong or this just wont work in my use case. I am worried that streaming from a remote file using rasterio just doesnt work in this case.
How can I change the code below so that it calls the functions concurrently and collects all the responses below when they are all completed? I haven't used this feature in python before, so just need a little more clarification.
async def read_from_file(s3_path):
# The important thing to note here is that it
# is streaming from a file in s3 given an s3 path
with rasterio.open(s3_path) as src:
values = src.read(1, window=Window(1, 2, 1, 1))
return values[0][0]
#app.get("/get-all")
async def get_all():
start_time = datetime.datetime.now()
# example paths
s3_paths = [
"s3:file-1",
"s3:file-2",
"s3:file-3",
"s3:file-4",
"s3:file-5",
"s3:file-6",
]
values = await asyncio.gather(
read_from_file(s3_paths[0]),
read_from_file(s3_paths[1]),
read_from_file(s3_paths[2]),
read_from_file(s3_paths[3]),
read_from_file(s3_paths[4]),
read_from_file(s3_paths[5]),
)
end_time = datetime.datetime.now()
logger.info(f"duration: {end_time-start_time}")
Python asyncio has a mechanism to run the non-async code, like the calls to the rasterio lib, in other threads, so that the async loop is not blocked.
Try this code:
import asyncio
from functools import partial
async def read_from_file(s3_path):
# The important thing to note here is that it
# is streaming from a file in s3 given an s3 path
loop = asyncio.get_running_loop()
try:
src = await loop.run_in_executor(None, rasterio.open, s3_path)
values = await loop.run_in_executor(None, partial(src.read, 1, window=Window(1, 2, 1, 1))
finally:
src.close() # might be interesting to paralelize this as well
return values[0][0]
If it needs to be faster, you can create a custom executor: the default one will only use n_cpu threads, I think, and might slow things down when the bottleneck is the network latency - some point around 20 threads might be interesting. (This executor should be either a global resource, or passed as parameter to your read_from_file, and is a plain concurrent.futures.ThreadpoolPoolExecutor (https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor)
As for the run_in_executor, check https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor
I've written a Python library that currently makes several independent HTTP requests in serial. I'd like to parallelize these requests without altering how the library is called by users, or requiring users to be aware that calls are being made asynchronously under the hood. The library is meant for novice/intermediate Python users mostly using Jupyter, and I'd like it to work without introducing them to unfamiliar async/await semantics.
The following example, which works in Jupyter, illustrates what I'd like to achieve but requires use of await to invoke the code on the final line:
import asyncio
async def first_request():
await asyncio.sleep(2) # Simulate request time
return "First request response"
async def second_request():
await asyncio.sleep(2)
return "Second request response"
async def make_requests_in_parallel():
"""Make requests in parallel and return the responses."""
return await asyncio.gather(first_request(), second_request())
results = await make_requests_in_parallel() # Undesirable use of `await`
I've found previous answers describing how to call async code from synchronous code using asyncio.run(). In the Jupyter example above, I can replace the final line with the following to create a working, importable Python module:
def main():
"""Make results available to async-naive users"""
return asyncio.run(make_requests_in_parallel())
results = main() # No `await` needed to get results -- good!
This seems to be what I want. However, in Jupyter, the code will produce an error:
RuntimeError: asyncio.run() cannot be called from a running event loop
A comment on the same answer above explains that because Jupyter runs its own async event loop, there is no need (or, apparently, option) to start another one, so async code can "simply" be called using await. In my situation, though, avoiding await is why I wanted to use asyncio.run() in the first place.
This seems to suggest that existing synchronous libraries cannot, by any means, internally parallelize any operation using asyncio without altering their public API to require use of await. Is this true?
If so, are there more practical alternatives to asyncio that would let me parallelize a group of requests in an internal function without educating my users about async/await?
I found a great solution for this: nest_asyncio.
Once installed, the working solution in Jupyter is as follows:
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def first_request():
await asyncio.sleep(2) # Simulate request time
return "First request response"
async def second_request():
await asyncio.sleep(2)
return "Second request response"
async def make_requests_in_parallel():
"""Make requests in parallel and return the responses."""
return await asyncio.gather(first_request(), second_request())
def main():
"""Make results available to async-naive users"""
return asyncio.run(make_requests_in_parallel())
results = main() # No `await` needed to get results
I've an async code that looks like this:
There's a third-party function that performs some operations on the string and returns a modified string, for the purpose of this question, it's something like non_async_func.
I've an async def async_func_single function that wraps around the non_async_func that performs a single operation.
Then another async def async_func_batch function that nested-wraps around async_func_single to perform the function for a batch of data.
The code kind of works but I would like to know more about why/how, my questions are
Is it necessary to create the async_func_single and have async_func_batch wrap around it?
Can I directly just feed in a batch of data in async_func_batch to call non_async_func?
I have a per_chunk function that feeds in the data in batches, is there any asyncio operations/functions that can avoid the use of pre-batching the data I want to send to async_func_batch?
import nest_asyncio
nest_asyncio.apply()
import asyncio
from itertools import zip_longest
from loremipsum import get_sentences
def per_chunk(iterable, n=1, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
def non_async_func(text):
return text[::-1]
async def async_func_single(text):
# Perform some string operation.
return non_async_func(text)
async def async_func_batch(batch):
tasks = [async_func_single(text) for text in batch]
return await asyncio.gather(*tasks)
# Create some random inputs
thousand_texts = get_sentences(1000)
# Loop through 20 sentence at a time.
for batch in per_chunk(thousand_texts, n=20):
loop = asyncio.get_event_loop()
results = loop.run_until_complete(async_func_batch(batch))
for i, o in zip(thousand_texts, results):
print(i, o)
Note that marking your functions as "async def", rather than "def" doesn't make them automatically asynchronous - you can have "async def" functions that are synchronous. The difference between asynchronous functions and synchronous ones is that asynchronous functions define places (using "await") where it waits on either another asynchronous function or waits on an asynchronous IO operation.
Also note that asyncio is not magic - it is basically a scheduler that schedules asynchronous functions to be run based on whether the function/operation that is being "awaited" has completed. And, as the scheduler and the asynchronous functions all run on a single thread, then at any given moment, only a single asynchronous function can be running.
So, going back to your code, the only thing your "async_func_single" function is doing is calling an synchronous function, therefore, despite being marked as "async def", it is still a synchronous function. And the same logic applies to the "async_func_batch" function - the "async_func_single" tasks passed to "asyncio.gather" are all synchronous, so the "asyncio.gather" is just running each task synchronously (so it is not offering up any benefits over a simple for loop waiting on each task), so again the "async_func_batch" is a synchronous function. Because you are just calling synchronous functions, then asyncio is not offering any benefits to your program.
If you want multiple synchronous functions that all run at the same time, you don't use asynchronous functions. You need to run them in parallel processes/threads:
import sys
import itertools
import concurrent.futures
from loremipsum import get_sentences
executor = concurrent.futures.ProcessPoolExecutor(workers=sys.cpu_count())
def per_chunk(iterable, n=1):
while True:
chunk = tuple(itertools.islice(iterable, n))
if chunk:
yield chunk
else:
break
def non_async_func(text):
return text[::-1]
def process_batches(batches):
futures = [
executor.submit(non_async_func, batch)
for batch in batches
]
concurrent.futures.wait(futures)
thousand_texts = get_sentences(1000)
process_batches(per_chunk(thousand_texts, n=20))
If you still want to use an asynchronous function to process the batches, then asyncio provides asynchronous wrappers around the concurrent futures:
async def process_batches(batches):
event_loop = asyncio.get_running_loop()
futures = [
event_loop.run_in_executor(executor, non_async_func, batch)
for batch in batches
]
await asyncio.wait(futures)
thousand_texts = get_sentences(1000)
asyncio.run(process_batches(per_chunk(thousand_texts, n=20)))
but it gives no advantages unless you have other asynchronous functions that can be run while it is waiting.
I have tried to answer your questions below.
The code kind of works but I would like to know more about why/how, my questions are
Is it necessary to create the async_func_single and have
async_func_batch wrap around it?
No, this is absolutely not necessary.
Can I directly just feed in a batch of data in async_func_batch to
call non_async_func?
You could do something like the example 1 below, where you feed all the data directly.
I have a per_chunk function that feeds in the data in batches, is
there any asyncio operations/functions that can avoid the use of
pre-batching the data I want to send to async_func_batch?
It's possible to use Asyncio Queues with a max size and then process data until the queue is empty and fill it up again. Check out example 2.
Example 1
import asyncio
from concurrent.futures import ThreadPoolExecutor
from loremipsum import get_sentences
def non_async_func(text):
return text[::-1]
async def async_func_batch(batch):
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [loop.run_in_executor(executor, non_async_func, text) for text in batch]
return(await asyncio.gather(*futures))
# Create some random inputs
thousand_texts = get_sentences(1000)
# Loop through 20 sentence at a time.
loop = asyncio.get_event_loop()
results = loop.run_until_complete(async_func_batch(thousand_texts))
for i, o in zip(thousand_texts, results):
print(i, o)
Example 2
Queues can be infinite in size. If you do not specify maxsize it will Queue up all elements before processing. If you remove maxsize, then you need to move join outside of the for-loop and remove if taskQueue.full():.
from loremipsum import get_sentences
import asyncio
async def async_func(text, taskQueue, resultsQueue):
await resultsQueue.put(text[::-1]) # add the result to the resultsQueue
taskQueue.task_done() # Tell the taskQueue that the task is finished
taskQueue.get_nowait() # Don't wait for it (unblocking)
async def main():
taskQueue = asyncio.Queue(maxsize=20)
resultsQueue = asyncio.Queue()
thousand_texts = get_sentences(1000)
results = []
for text in thousand_texts:
await taskQueue.put(asyncio.create_task(async_func(text, taskQueue, resultsQueue)))
if taskQueue.full(): # If maxsize is reached
await taskQueue.join() # Will block until finished
while not resultsQueue.empty():
results.append(await resultsQueue.get())
for i, o in zip(thousand_texts, results):
print(i, o)
if __name__ == "__main__":
asyncio.run(main())
I launch a bunch of requests using aiohttp. Is there a way to get the results one-by-one as soon as each request is complete?
Perhaps using something like async for? Or Python 3.6 async generators?
Currently I await asyncio.gather(*requests) and process them when all of them are completed.
asyncio has as_completed function that probably does what you need. Note, it returns regular iterator, not async.
Here's example of usage:
import asyncio
async def test(i):
await asyncio.sleep(i)
return i
async def main():
fs = [
test(1),
test(2),
test(3),
]
for f in asyncio.as_completed(fs):
i = await f # Await for next result.
print(i, 'done')
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(main())
finally:
loop.run_until_complete(loop.shutdown_asyncgens())
loop.close()
Output:
1 done
2 done
3 done
Canonical way is pushing result into asyncio.Queue like in crawler example.
Also it's wise to run limited amount for download tasks which get new job from input queue instead of spawning a million of new tasks.
As I understand according to the docs, requests are Futures (or can be easily converted to Future using asyncio.ensure_future).
A Future object has a method .add_done_callback.
So, you can add your callback for every request, and then do gather.
Docs for Future.add_done_callback
I've written a library of objects, many which make HTTP / IO calls. I've been looking at moving over to asyncio due to the mounting overheads, but I don't want to rewrite the underlying code.
I've been hoping to wrap asyncio around my code in order to perform functions asynchronously without replacing all of my deep / low level code with await / yield.
I began by attempting the following:
async def my_function1(some_object, some_params):
#Lots of existing code which uses existing objects
#No await statements
return output_data
async def my_function2():
#Does more stuff
while True:
loop = asyncio.get_event_loop()
tasks = my_function(some_object, some_params), my_function2()
output_data = loop.run_until_complete(asyncio.gather(*tasks))
print(output_data)
I quickly realised that while this code runs, nothing actually happens asynchronously, the functions complete synchronously. I'm very new to asynchronous programming, but I think this is because neither of my functions are using the keyword await or yield and thus these functions are not cooroutines, and do not yield, thus do not provide an opportunity to move to a different cooroutine. Please correct me if I am wrong.
My question is, is it possible to wrap complex functions (where deep within they make HTTP / IO calls ) in an asyncio await keyword, e.g.
async def my_function():
print("Welcome to my function")
data = await bigSlowFunction()
UPDATE - Following Karlson's Answer
Following and thanks to Karlsons accepted answer, I used the following code which works nicely:
from concurrent.futures import ThreadPoolExecutor
import time
#Some vars
a_var_1 = 0
a_var_2 = 10
pool = ThreadPoolExecutor(3)
future = pool.submit(my_big_function, object, a_var_1, a_var_2)
while not future.done() :
print("Waiting for future...")
time.sleep(0.01)
print("Future done")
print(future.result())
This works really nicely, and the future.done() / sleep loop gives you an idea of how many CPU cycles you get to use by going async.
The short answer is, you can't have the benefits of asyncio without explicitly marking the points in your code where control may be passed back to the event loop. This is done by turning your IO heavy functions into coroutines, just like you assumed.
Without changing existing code you might achieve your goal with greenlets (have a look at eventlet or gevent).
Another possibility would be to make use of Python's Future implementation wrapping and passing calls to your already written functions to some ThreadPoolExecutor and yield the resulting Future. Be aware, that this comes with all the caveats of multi-threaded programming, though.
Something along the lines of
from concurrent.futures import ThreadPoolExecutor
from thinair import big_slow_function
executor = ThreadPoolExecutor(max_workers=5)
async def big_slow_coroutine():
await executor.submit(big_slow_function)
As of python 3.9 you can wrap a blocking (non-async) function in a coroutine to make it awaitable using asyncio.to_thread(). The exampe given in the official documentation is:
def blocking_io():
print(f"start blocking_io at {time.strftime('%X')}")
# Note that time.sleep() can be replaced with any blocking
# IO-bound operation, such as file operations.
time.sleep(1)
print(f"blocking_io complete at {time.strftime('%X')}")
async def main():
print(f"started main at {time.strftime('%X')}")
await asyncio.gather(
asyncio.to_thread(blocking_io),
asyncio.sleep(1))
print(f"finished main at {time.strftime('%X')}")
asyncio.run(main())
# Expected output:
#
# started main at 19:50:53
# start blocking_io at 19:50:53
# blocking_io complete at 19:50:54
# finished main at 19:50:54
This seems like a more joined up approach than using concurrent.futures to make a coroutine, but I haven't tested it extensively.