Objective
I am processing some files then uploading them to the cloud. My objective is to have an async producer and consumer. The producer copies data from a database, and compresses files, while the consumer uploads files to the cloud. My objective is to produce tasks as soon as they are available but only consume them one by one (I am trying to avoid sending too many files to s3 as I get throttled by aws).
What I have tried to do
I have successfully created a queue of a consumer and producer; however, my producer seems to be producing one item at a time. Here is the code I have tried (please let me know if I can add any further code):
import asyncio
import concurrent.futures
import logging
from asyncio import Queue
import aiobotocore
import asyncpg
import uvloop
from asyncpg.pool import Pool
from aws_async import upload_to_s3
from helpers import gather_with_concurrency, chunk_query, build_query, parquet_data
from pg_async import pg_copy
from resources.config import ARGS, PG_CONFIG, AWS_S3_CONFIG
# Defaults to CPU count and spawns up to 32 threads
EXECUTOR = concurrent.futures.ThreadPoolExecutor()
async def producer(query: str, pool: Pool, q: Queue):
# I expect all that loop to run async but it runs 1 by 1..
for qr in query:
# Process query and place (asynchronously) into queue
query_data = await pg_copy(pool, qr)
await q.put(query_data)
async def consume(queue, client, bucket):
while True:
item = await queue.get()
if item is None:
break
await upload_to_s3(client, bucket, item)
queue.task_done()
async def main():
# Define parameters based on parsed args
queries = build_query(ARGS.query, bool(ARGS.chunks), chunk_query, chunks=ARGS.chunks, chunk_col=ARGS.chunk_col)
# Define pool
pool = await asyncpg.create_pool(**PG_CONFIG)
# Define Aiobotocore async client
async with aiobotocore.get_session().create_client(**AWS_S3_CONFIG) as client:
# Prepare tasks from queries
queue = asyncio.Queue()
consumer = asyncio.create_task(consume(queue, client, bucket=ARGS.bucket))
await gather_with_concurrency(ARGS.parallelism, producer(queries, pool, queue))
# wait for the remaining tasks to be processed
await queue.join()
consumer.cancel()
uvloop.install()
asyncio.run(main())
Thanks to this answer, my gather_with_concurrency (which I use to throttle concurrent tasks running at a time) looks like:
async def gather_with_concurrency(n: int, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
What I am trying to achieve
The current behavior is that the producer is producing items 1 by 1. I want to have my producer processing all tasks asynchronously and place them into the queue as soon as they are made available (while still keeping the throttle that I have created), but my consumer consuming 1 task at a time.
Happy to add any further information if needed = )
Edit:
Thanks to dirn for helping me clarify my question earlier
Related
I have multiple couroutines each of which waits for content in a queue to start processing.
The content for the queues is populated by channel subscribers whose job is only to receive messages a push an item in the appropriate queue.
After the data is consumed by one queue processor and new data is generated it's dispatched to the appropriate message channel where this process is repeated until the data is ready to be relayed to an api that provisions it.
import asyncio
from random import randint
from Models.ConsumerStrategies import Strategy
from Helpers.Log import Log
import Connectors.datastore as ds
import json
__name__ = "Consumer"
MIN = 1
MAX = 4
async def consume(configuration: dict, queue: str, processor: Strategy) -> None:
"""Consumes new items in queue and publish a message into the appropriate channel with the data generated for the next consumer,
if no new content is available sleep for a random number of seconds between MIN and MAX global variables
Args:
configuration (dict): configuration dictionary
queue (str): queue being consumed
processor (Strategy): consumer strategy
"""
logger = Log().get_logger(processor.__name__, configuration['logFolder'], configuration['logFormat'], configuration['USE'])
while True:
try:
ds_handle = await ds.get_datastore_handle(ds.get_uri(conf=configuration))
token = await ds_handle.lpop(queue)
if token is not None:
result = await processor.consume(json.loads(token), ds_handle)
status = await processor.relay(result, ds_handle)
logger.debug(status)
else:
wait_for = randint(MIN,MAX)
logger.debug(f'queue: {queue} empty waiting: {wait_for} before retry')
await asyncio.sleep(wait_for)
ds_handle.close()
except Exception as e:
logger.error(f"{e}")
logger.error(f"{e.with_traceback}")
What I'm noticing is that after a 24h run I'm getting these errors:
Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<consume() running at Services/Consumer.py:26> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f86bc29cbe0>()]> cb=[_chain_future.<locals>._call_set_state() at asyncio/futures.py:391]>
Task was destroyed but it is pending!
task: <Task pending name='Task-426485' coro=<RedisConnection._read_data() done, defined at aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f86bc29ccd0>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at aioredis/connection.py:168]>
Which I'm not sure on how to interpret, resolve or recover from, my assumption is that first I should probably switch to redis streams instead of using channels and queues.
But, going back to this scenarios I have channel subscribers on different processes while the consumer run in the same process as different tasks in the loop.
What I'm assuming is happening here is that since the consumer is basically polling a queue at some point the connection pool manager or redis itself eventually starts hanging up on the connection open of the consumer and it gets cancelled.
Cause I'm not seeing any further message from that queue processor, but I also see that wait_for_future which I'm uncertain it may come from the subscriber ensure_future on the message reader
import asyncio
from multiprocessing import process
from Helpers.Log import Log
import Services.Metas as metas
import Models.SubscriberStrategies as processor
import Connectors.datastore as ds_linker
import Models.Exceptions as Exceptions
async def subscriber(conf: dict, channel: str, processor: processor.Strategy) -> None:
"""Subscription handler. Receives the channel name, datastore connection and a parsing strategy.
Creates a task that listens on the channel and process every message and processing strategy for the specific message
Args:
conf (dict): configuration dictionary
channel (str): channel to subscribe to
ds (aioredis.connection): connection handler to datastore
processor (processor.Strategy): processor message handler
"""
async def reader(ch):
while await ch.wait_message():
msg = await ch.get_json()
await processor.handle_message(msg=msg)
ds_uri = ds_linker.get_uri(conf=conf)
ds = await ds_linker.get_datastore_handle(ds_uri)
pub = await ds.subscribe(channel)
ch = pub[0]
tsk = asyncio.ensure_future(reader(ch))
await tsk
I could use some help to sort this out and properly understand what's happening under the hood. thanks
Took a few days to solve just to reproduce the issue, I've found people with the same problem in the issues for the aioredis github repo.
So I had to go through all the connection open/close with redis to be sure added:
ds_handle.close()
await ds_handle.wait_closed()
I also proceeded to improve the exception management in the consumer:
while True:
try:
ds_handle = await ds.get_datastore_handle(ds.get_uri(conf=configuration))
token = await ds_handle.lpop(queue)
if token is not None:
result = await processor.consume(json.loads(token), ds_handle)
status = await processor.relay(result, ds_handle)
logger.debug(status)
else:
wait_for = randint(MIN,MAX)
logger.debug(f'queue: {queue} empty waiting: {wait_for} before retry')
await asyncio.sleep(wait_for)
except Exception as e:
logger.error(f"{e}")
logger.error(f"{traceback.print_exc()}")
finally:
ds_handle.close()
await ds_handle.wait_closed()
and the same for the producer:
try:
async def reader(ch):
while await ch.wait_message():
msg = await ch.get_json()
await processor.handle_message(msg=msg)
ds_uri = ds_linker.get_uri(conf=conf)
ds = await ds_linker.get_datastore_handle(ds_uri)
pub = await ds.subscribe(channel)
ch = pub[0]
tsk = asyncio.ensure_future(reader(ch))
await tsk
except Exception as e:
logger.debug(f'{e}')
logger.error(f'{traceback.format_exc()}')
finally:
ds.close()
await ds.wait_closed()
so there are never connections left open with redis that might end up killing one of the processor's coroutines as time goes by.
For me it solved the issue, since at the time I'm writing this it has been more than 2 weeks uptime with no more reported accidents of the same kind.
Anyway, there is also a new aioredis major release, it's really recent news (this was on 1.3.1 and 2.0.0 should work using the same model as redis-py, so things have changed as well by this time).
I'm watching a video(a) on YouTube about asyncio and, at one point, code like the following is presented for efficiently handling multiple HTTP requests:
# Need an event loop for doing this.
loop = asyncio.get_event_loop()
# Task creation section.
tasks = []
for n in range(1, 50):
tasks.append(loop.create_task(get_html(f"https://example.com/things?id={n}")))
# Task processing section.
for task in tasks:
html = await task
thing = get_thing_from_html(html)
print(f"Thing found: {thing}", flush=True)
I realise that this is efficient in the sense that everything runs concurrently but what concerns me is a case like:
the first task taking a full minute; but
all the others finishing in under three seconds.
Because the task processing section awaits completion of the tasks in the order in which they entered the list, it appears to me that none will be reported as complete until the first one completes.
At that point, the others that finished long ago will also be reported. Is my understanding correct?
If so, what is the normal way to handle that scenario, so that you're getting completion notification for each task the instant that task finishes?
(a) From Michael Kennedy of "Talk Python To Me" podcast fame. The video is Demystifying Python's Async and Await Keywords if you're interested. I have no affiliation with the site other than enjoying the podcast, so heartily recommend it.
If you just need to do something after each task, you can create another async function that does it, and run those in parallel:
async def wrapped_get_html(url):
html = await get_html(url)
thing = get_thing_from_html(html)
print(f"Thing found: {thing}")
async def main():
# shorthand for creating tasks and awaiting them all
await asyncio.gather(*
[wrapped_get_html(f"https://example.com/things?id={n}")
for n in range(50)])
asyncio.run(main())
If for some reason you need your main loop to be notified, you can do that with as_completed:
async def main():
for next_done in asyncio.as_completed([
get_html(f"https://example.com/things?id={n}")
for n in range(50)]):
html = await next_done
thing = get_thing_from_html(html)
print(f"Thing found: {thing}")
asyncio.run(main())
You can make the tasks to run in parallel with the code example below. I introduced asyncio.gather to make task run concurrently. Also I demonstrated poison pill technique and daemon task technique.
Please follow comments in code and feel free to ask questions if you have any.
import asyncio
from random import randint
WORKERS_NUMBER = 5
URL_NUM = 20
async def producer(task_q: asyncio.Queue) -> None:
"""Produce tasks and send them to workers"""
print("Producer-Task Started")
# imagine that it is a list of urls
for i in range(URL_NUM):
await task_q.put(i)
# send poison pill to workers
for i in range(WORKERS_NUMBER):
await task_q.put(None)
print("Producer-Task Finished")
async def results_shower(result_q: asyncio.Queue) -> None:
"""Receives results from worker tasks and show the result"""
while True:
res = await result_q.get()
print(res)
result_q.task_done() # confirm that task is done
async def worker(
name: str,
task_q: asyncio.Queue,
result_q: asyncio.Queue,
) -> None:
"""Get's tasks from task_q, do some job and send results to result_q"""
print(f"Worker {name} Started")
while True:
task = await task_q.get()
# if worker received poison pill - break
if task is None:
break
await asyncio.sleep(randint(1, 10))
result = task ** 2
await result_q.put(result)
print(f"Worker {name} Finished")
async def amain():
"""Wrapper around all async ops in the app"""
_task_q = asyncio.Queue(maxsize=5) # just some random maxsize
_results_q = asyncio.Queue(maxsize=5) # just some random maxsize
# we run results_shower as a "daemon task", so we never await
# if asyncio loop has nothing else to do, loop stops
# without waiting for "daemon task"
asyncio.create_task(results_shower(_results_q))
# gather block means that we run task in parallel and wait till all the task are finished
await asyncio.gather(
producer(_task_q),
*[worker(f"W-{i}", _task_q, _results_q) for i in range(WORKERS_NUMBER)]
)
# q.join() prevents loop from stopping, until results_shower print all task result
# it has some internal counter, which is decreased by task_done and increases
# q.put(). If counter is 0, the q can join.
await _results_q.join()
print("All work is finished!")
if __name__ == '__main__':
asyncio.run(amain())
I want to run a simple background task in FastAPI, which involves some computation before dumping it into the database. However, the computation would block it from receiving any more requests.
from fastapi import BackgroundTasks, FastAPI
app = FastAPI()
db = Database()
async def task(data):
otherdata = await db.fetch("some sql")
newdata = somelongcomputation(data,otherdata) # this blocks other requests
await db.execute("some sql",newdata)
#app.post("/profile")
async def profile(data: Data, background_tasks: BackgroundTasks):
background_tasks.add_task(task, data)
return {}
What is the best way to solve this issue?
Your task is defined as async, which means fastapi (or rather starlette) will run it in the asyncio event loop.
And because somelongcomputation is synchronous (i.e. not waiting on some IO, but doing computation) it will block the event loop as long as it is running.
I see a few ways of solving this:
Use more workers (e.g. uvicorn main:app --workers 4). This will allow up to 4 somelongcomputation in parallel.
Rewrite your task to not be async (i.e. define it as def task(data): ... etc). Then starlette will run it in a separate thread.
Use fastapi.concurrency.run_in_threadpool, which will also run it in a separate thread. Like so:
from fastapi.concurrency import run_in_threadpool
async def task(data):
otherdata = await db.fetch("some sql")
newdata = await run_in_threadpool(lambda: somelongcomputation(data, otherdata))
await db.execute("some sql", newdata)
Or use asyncios's run_in_executor directly (which run_in_threadpool uses under the hood):
import asyncio
async def task(data):
otherdata = await db.fetch("some sql")
loop = asyncio.get_running_loop()
newdata = await loop.run_in_executor(None, lambda: somelongcomputation(data, otherdata))
await db.execute("some sql", newdata)
You could even pass in a concurrent.futures.ProcessPoolExecutor as the first argument to run_in_executor to run it in a separate process.
Spawn a separate thread / process yourself. E.g. using concurrent.futures.
Use something more heavy-handed like celery. (Also mentioned in the fastapi docs here).
If your task is CPU bound you could use multiprocessing, there is way to do that with Background task in FastAPI:
https://stackoverflow.com/a/63171013
Although you should consider to use something like Celery if there are lot of cpu-heavy tasks.
Read this issue.
Also in the example below, my_model.function_b could be any blocking function or process.
TL;DR
from starlette.concurrency import run_in_threadpool
#app.get("/long_answer")
async def long_answer():
rst = await run_in_threadpool(my_model.function_b, arg_1, arg_2)
return rst
This is a example of Background Task To FastAPI
from fastapi import FastAPI
import asyncio
app = FastAPI()
x = [1] # a global variable x
#app.get("/")
def hello():
return {"message": "hello", "x":x}
async def periodic():
while True:
# code to run periodically starts here
x[0] += 1
print(f"x is now {x}")
# code to run periodically ends here
# sleep for 3 seconds after running above code
await asyncio.sleep(3)
#app.on_event("startup")
async def schedule_periodic():
loop = asyncio.get_event_loop()
loop.create_task(periodic())
if __name__ == "__main__":
import uvicorn
uvicorn.run(app)
I have subscribed to a MQ queue. Every time I get a message, I pass it a function that then performs a number of time-consuming I/O actions on it.
The issue is that everything happens serially.
A request comes in, it picks up the request, performs the action by calling the function, and then picks up the next request.
I want to do this asynchronously so that multiple requests can be dealt with in an async manner.
results = []
queue = queue.subscribe(name)
async for message in queue:
yield my_funcion(message)
The biggest issue is that my_function is slow because it calls external web services and I want my code to process other messages in the meantime.
I tried to implement it above but it doesn't work! I am not sure how to implement async here.
I can't create a task because I don't know how many requests will be received. It's a MQ which I have subscribed to. I loop over each message and perform an action. I don't want for the function to complete before I perform the action on the next message. I want it to happen asynchronously.
If I understand your request, what you need is a queue that your request handlers fill, and that you read from from the code that needs to do something with the results.
If you insist on an async iterator, it is straightforward to use a generator to expose the contents of a queue. For example:
def make_asyncgen():
queue = asyncio.Queue(1)
async def feed(item):
await queue.put(item)
async def exhaust():
while True:
item = await queue.get()
yield item
return feed, exhaust()
make_asyncgen returns two objects: an async function and an async generator. The two are connected in such a way that, when you call the function with an item, the item gets emitted by the generator. For example:
import random, asyncio
# Emulate a server that takes some time to process each message,
# and then provides a result. Here it takes an async function
# that it will call with the result.
async def serve(server_ident, on_message):
while True:
await asyncio.sleep(random.uniform(1, 5))
await on_message('%s %s' % (server_ident, random.random()))
async def main():
# create the feed function, and the generator
feed, get = make_asyncgen()
# subscribe to serve several requests in parallel
asyncio.create_task(serve('foo', feed))
asyncio.create_task(serve('bar', feed))
asyncio.create_task(serve('baz', feed))
# process results from all three servers as they arrive
async for msg in get:
print('received', msg)
asyncio.run(main())
I adapted this code for using Google Cloud PubSub in Async Python: https://github.com/cloudfind/google-pubsub-asyncio
import asyncio
import datetime
import functools
import os
from google.cloud import pubsub
from google.gax.errors import RetryError
from grpc import StatusCode
async def message_producer():
""" Publish messages which consist of the current datetime """
while True:
await asyncio.sleep(0.1)
async def proc_message(message):
await asyncio.sleep(0.1)
print(message)
message.ack()
def main():
""" Main program """
loop = asyncio.get_event_loop()
topic = "projects/{project_id}/topics/{topic}".format(
project_id=PROJECT, topic=TOPIC)
subscription_name = "projects/{project_id}/subscriptions/{subscription}".format(
project_id=PROJECT, subscription=SUBSCRIPTION)
subscription = make_subscription(
topic, subscription_name)
def create_proc_message_task(message):
""" Callback handler for the subscription; schedule a task on the event loop """
print("Task created!")
task = loop.create_task(proc_message(message))
subscription.open(create_proc_message_task)
# Produce some messages to consume
loop.create_task(message_producer())
print("Subscribed, let's do this!")
loop.run_forever()
def make_subscription(topic, subscription_name):
""" Make a publisher and subscriber client, and create the necessary resources """
subscriber = pubsub.SubscriberClient()
try:
subscriber.create_subscription(subscription_name, topic)
except:
pass
subscription = subscriber.subscribe(subscription_name)
return subscription
if __name__ == "__main__":
main()
I basically removed the publishing code and only use the subscription code.
However, initially I did not include the loop.create_task(message_producer()) line. I figured that tasks were created as they were supposed to however they never actually run themselves. Only if I add said line the code properly executes and all created Tasks run. What causes this behaviour?
PubSub is calling the create_proc_message_task callback from a different thread. Since create_task is not thread-safe, it must only be called from the thread that runs the event loop (typically the main thread). To correct the issue, replace loop.create_task(proc_message(message)) with asyncio.run_coroutine_threadsafe(proc_message(message), loop) and message_producer will no longer be needed.
As for why message_producer appeared to fix the code, consider that run_coroutine_threadsafe does two additional things compared to create_task:
It operates in a thread-safe fashion, so the event loop data structures are not corrupted when this is done concurrently.
It ensures that the event loop wakes up at the soonest possible opportunity, so that it can process the new task.
In your case create_task added the task to the loop's runnable queue (without any locking), but failed to ensure the wakeup, because that is not needed when running in the event loop thread. The message_producer then served to force the loop to wake up in regular intervals, which is when it also checks and executes the runnable tasks.