How to run in parallel queries via asyncpg? [duplicate]

How to run in parallel queries via asyncpg? [duplicate] - python

I'm learning about the magic of concurrency in python and I have a script that I have been using await in(via fastapi's framework).
I need to get multiple data from my database and do something like:
DBresults1 = await db_conn.fetch_rows(query_statement1)
DBresults2 = await db_conn.fetch_rows(query_statement2)
DBresults3 = await db_conn.fetch_rows(query_statement3)
#then on to processing...
The problem is my awaits are causing my code to run sequentially which is getting to be slow.
Is there a way to have my 3 queries(or any function call for that matter) to run without await but await until the entire group is done so it can execute concurrently?

You can use gather to await multiple tasks non-sequentially. You mentioned FastAPI, this example might help. It does a 3 second task and a 2 second task in ~3 seconds total. Although the first task takes longer, gather will give you results in the order you list them, not the order they complete.
For this to work, the long running part needs to be actual awaitable IO (simulated with asyncio.sleep here) not CPU bound work.
If you run this in a terminal, then call GET localhost:8080 from Postman (or whatever is convenient for you) you'll see what's happening in the logs.
import logging
import asyncio
import time
from fastapi import FastAPI
logging.basicConfig(level=logging.INFO, format="%(levelname)-9s %(asctime)s - %(name)s - %(message)s")
LOGGER = logging.getLogger(__name__)
app = FastAPI()
async def takes_three():
LOGGER.info("will sleep for 3s...")
await asyncio.sleep(3)
LOGGER.info("takes_three awake!")
return "takes_three done"
async def takes_two():
LOGGER.info("will sleep for 2s...")
await asyncio.sleep(2)
LOGGER.info("takes_two awake!")
return "takes_two done"
#app.get("/")
async def root():
start_time = time.time()
LOGGER.info("starting...")
results = await asyncio.gather(*[takes_three(), takes_two()])
duration = time.time() - start_time
LOGGER.info(f"results: {results} after {duration:.2f}")
return f"finished after {duration:.2f}s!"
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="127.0.0.1", port=8080)

you can use asyncio.wait like in example below
async def get_suggest(entity, ids, company_id):
...
return ...
jobs = []
for entity, ids, in dict(data.filters).items():
jobs.append(get_suggest(entity, ids, company_id))
result = {}
done, _ = await asyncio.wait(jobs)
for job in done:
values, entity = job.result()
result[entity] = values

Related

How to write a DataFrame to csv while ingesting data via an API with asyncio and aiohttp

I built an API wrapper module in Python with aiohttp that allows me to significantly speed up the process of making multiple GET requests and retrieving data. Every data response is turned into a pandas DataFrame.
Using asyncio I do something that looks like this:
import asyncio
from custom_module import CustomAioClient
id_list = ["123", "456"]
async def main():
client = CustomAioClient()
tasks = []
for id in id_list:
task = asyncio.ensure_future(client.get_latest_value(id=id))
tasks.append(task)
responses = await asyncio.gather(*tasks, return_exceptions=True)
# Close the session
await client.close_session()
return responses
if __name__ == "__main__":
asyncio.run(main())
This returns a list of pandas DataFrames with time series for each id in the id_list that I want to save as csv files. I am a bit confused on how to proceed here.
Obviously I could just iterate over the list and save every DataFrame iteratively, but this seems highly inefficient to me. Is there a way to improve things here?
Edit
I did the following to save things and it is much faster than just iterating over multiple URLs, getting the data and saving it. I doubt whether this fully makes use of the asynchronous functionalities though.
import asyncio
from custom_module import CustomAioClient
async def fetch(client: CustomAioClient, id: str):
df = await client.get_latest_value(id=id)
df.to_csv(f"C:/{id}.csv")
print(df)
async def main():
client = CustomAioClient()
id_list = ["123", "456"]
tasks = []
for id in id_list:
task = asyncio.ensure_future(fetch(client=client, id=id))
tasks.append(task)
responses = await asyncio.gather(*tasks, return_exceptions=True)
# Close the session
await client.close_session()
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())

Check out the example at the end of the asyncio article in Real Python
This example takes the approach of setting up the function call for doing a single operation which includes getting the data and then writing it to the file, then making a bulk_method to handle multiple requests.
Also, the use of the 'with' keyword should be used for actions that require specific setup and cleanup, such as opening a file or making a connection to a server.

You could declare a simple function that downloads the DataFrame and saves it to csv file. Then, you could call this function using a ThreadPoolExecutor and the ayncio event loop, something like this:
import asyncio
from concurrent.futures import ThreadPoolExecutor
from custom_module import CustomAioClient
def download_to_csv(client: CustomAioClient, id: str) -> None:
df = client.get_latest_value(id=id)
df.to_csv(f"{id}.csv")
async def process(id_list: list[str]) -> None:
client = CustomAioClient()
with ThreadPoolExecutor() as executor:
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(executor, download_to_csv(client, id)) for id in id_list]
await asyncio.gather(*tasks)
id_list = ["123", "456"]
if __name__ == "__main__":
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(process(id_list))
loop.run_until_complete(future)

Running blocking function (f.e. requests) concurrently but asynchronous with Python

There is a function that blocks event loop (f.e. that function makes an API request). I need to make continuous stream of requests which will run in parallel but not synchronous. So every next request will be started before the previous request will be finished.
So I found this solved question with the loop.run_in_executer() solution and use it in the beginning:
import asyncio
import requests
#blocking_request_func() defined somewhere
async def main():
loop = asyncio.get_event_loop()
future1 = loop.run_in_executor(None, blocking_request_func, 'param')
future2 = loop.run_in_executor(None, blocking_request_func, 'param')
response1 = await future1
response2 = await future2
print(response1)
print(response2)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
this works well, requests run in parallel but there is a problem for my task - in this example we make group of tasks/futures in the beginning and then run this group synchronous. But I need something like this:
1. Sending request_1 and not awaiting when it's done.
(AFTER step 1 but NOT in the same time when step 1 starts):
2. Sending request_2 and not awaiting when it's done.
(AFTER step 2 but NOT in the same time when step 2 starts):
3. Sending request_3 and not awaiting when it's done.
(Request 1(or any other) gives the response)
(AFTER step 3 but NOT in the same time when step 3 starts):
4. Sending request_4 and not awaiting when it's done.
(Request 2(or any other) gives the response)
and so on...
I tried using asyncio.TaskGroup():
async def request_func():
global result #the list of results of requests defined somewhere in global area
loop = asyncio.get_event_loop()
result.append(await loop.run_in_executor(None, blocking_request_func, 'param')
await asyncio.sleep(0) #adding or removing this line gives the same result
async def main():
async with asyncio.TaskGroup() as tg:
for i in range(0, 10):
tg.create_task(request_func())
all these things gave the same result: first of all we defined group of tasks/futures and only then run this group synchronous and concurrently. But is there a way to run all these requests concurrently but "in the stream"?
I tried to make visualization if my explanation is not clear enough.
What I have for now
What I need
================ Update with the answer ===================
The most close answer however with some limitations:
import asyncio
import random
import time
def blockme(n):
x = random.random() * 2.0
time.sleep(x)
return n, x
def cb(fut):
print("Result", fut.result())
async def main():
#You need to control threads quantity
pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)
loop = asyncio.get_event_loop()
futs = []
#You need to control requests per second
delay = 0.5
while await asyncio.sleep(delay, result=True):
fut = loop.run_in_executor(pool, blockme, n)
fut.add_done_callback(cb)
futs.append(fut)
#You need to control futures quantity, f.e. like this:
if len(futs)>40:
completed, futs = await asyncio.wait(futs,
timeout=5,
return_when=FIRST_COMPLETED)
asyncio.run(main())

I think this might be what you want. You don't have to await each request - the run_in_executor function returns a Future. Instead of awaiting that, you can attach a callback function:
import asyncio
import random
import time
def blockme(n):
x = random.random() * 2.0
time.sleep(x)
return n, x
def cb(fut):
print("Result", fut.result())
async def main():
loop = asyncio.get_event_loop()
futs = []
for n in range(20):
fut = loop.run_in_executor(None, blockme, n)
fut.add_done_callback(cb)
futs.append(fut)
await asyncio.gather(*futs)
# await asyncio.sleep(10)
asyncio.run(main())
All the requests are started at the beginning, but they don't all execute in parallel because the number of threads is limited by the ThreadPool. You can adjust the number of threads if you want.
Here I simulated a blocking call with time.sleep. I needed a way to prevent main() from ending before all the callbacks occurred, so I used gather for that purpose. You can also wait for some length of time, but gather is cleaner.
Apologies if I don't understand what you want. But I think you want to avoid using await for each call, and I tried to show one way you can do that.

This is directly referenced from Python documentation. The code snippet from documentation of asyncio library explains how you can run a blocking code concurrently using asyncio. It uses to_thread method to create task
you can find more here - https://docs.python.org/3/library/asyncio-task.html#running-in-threads
def blocking_io():
print(f"start blocking_io at {time.strftime('%X')}")
# Note that time.sleep() can be replaced with any blocking
# IO-bound operation, such as file operations.
time.sleep(1)
print(f"blocking_io complete at {time.strftime('%X')}")
async def main():
print(f"started main at {time.strftime('%X')}")
await asyncio.gather(
asyncio.to_thread(blocking_io),
asyncio.sleep(1))
print(f"finished main at {time.strftime('%X')}")
asyncio.run(main())

FastAPI asynchronous background tasks blocks other requests?

I want to run a simple background task in FastAPI, which involves some computation before dumping it into the database. However, the computation would block it from receiving any more requests.
from fastapi import BackgroundTasks, FastAPI
app = FastAPI()
db = Database()
async def task(data):
otherdata = await db.fetch("some sql")
newdata = somelongcomputation(data,otherdata) # this blocks other requests
await db.execute("some sql",newdata)
#app.post("/profile")
async def profile(data: Data, background_tasks: BackgroundTasks):
background_tasks.add_task(task, data)
return {}
What is the best way to solve this issue?

Your task is defined as async, which means fastapi (or rather starlette) will run it in the asyncio event loop.
And because somelongcomputation is synchronous (i.e. not waiting on some IO, but doing computation) it will block the event loop as long as it is running.
I see a few ways of solving this:
Use more workers (e.g. uvicorn main:app --workers 4). This will allow up to 4 somelongcomputation in parallel.
Rewrite your task to not be async (i.e. define it as def task(data): ... etc). Then starlette will run it in a separate thread.
Use fastapi.concurrency.run_in_threadpool, which will also run it in a separate thread. Like so:
from fastapi.concurrency import run_in_threadpool
async def task(data):
otherdata = await db.fetch("some sql")
newdata = await run_in_threadpool(lambda: somelongcomputation(data, otherdata))
await db.execute("some sql", newdata)
Or use asyncios's run_in_executor directly (which run_in_threadpool uses under the hood):
import asyncio
async def task(data):
otherdata = await db.fetch("some sql")
loop = asyncio.get_running_loop()
newdata = await loop.run_in_executor(None, lambda: somelongcomputation(data, otherdata))
await db.execute("some sql", newdata)
You could even pass in a concurrent.futures.ProcessPoolExecutor as the first argument to run_in_executor to run it in a separate process.
Spawn a separate thread / process yourself. E.g. using concurrent.futures.
Use something more heavy-handed like celery. (Also mentioned in the fastapi docs here).

If your task is CPU bound you could use multiprocessing, there is way to do that with Background task in FastAPI:
https://stackoverflow.com/a/63171013
Although you should consider to use something like Celery if there are lot of cpu-heavy tasks.

Read this issue.
Also in the example below, my_model.function_b could be any blocking function or process.
TL;DR
from starlette.concurrency import run_in_threadpool
#app.get("/long_answer")
async def long_answer():
rst = await run_in_threadpool(my_model.function_b, arg_1, arg_2)
return rst

This is a example of Background Task To FastAPI
from fastapi import FastAPI
import asyncio
app = FastAPI()
x = [1] # a global variable x
#app.get("/")
def hello():
return {"message": "hello", "x":x}
async def periodic():
while True:
# code to run periodically starts here
x[0] += 1
print(f"x is now {x}")
# code to run periodically ends here
# sleep for 3 seconds after running above code
await asyncio.sleep(3)
#app.on_event("startup")
async def schedule_periodic():
loop = asyncio.get_event_loop()
loop.create_task(periodic())
if __name__ == "__main__":
import uvicorn
uvicorn.run(app)

No speedup using asyncio despite awaiting API response

I am running a program that makes three different requests from a rest api. data, indicator, request functions all fetch data from BitMEX's api using a wrapper i've made.
I have used asyncio to try to speed up the process so that while i am waiting on a response from previous request, it can begin to make another one.
However, my asynchronous version is not running any quicker for some reason. The code works and as far as I know, I have set everything up correctly. But there could be something wrong with how I am setting up the coroutines?
Here is the asynchronous version:
import time
import asyncio
from bordemwrapper import BitMEXData, BitMEXFunctions
'''
asynchronous I/O
'''
async def data():
data = BitMEXData().get_ohlcv(symbol='XBTUSD', timeframe='1h',
instances=25)
await asyncio.sleep(0)
return data
async def indicator():
indicator = BitMEXData().get_indicator(symbol='XBTUSD',
timeframe='1h', indicator='RSI', period=20, source='close',
instances=25)
await asyncio.sleep(0)
return indicator
async def request():
request = BitMEXFunctions().get_price()
await asyncio.sleep(0)
return request
async def chain():
data_ = await data()
indicator_ = await indicator()
request_ = await request()
return data_, indicator_, request_
async def main():
await asyncio.gather(chain())
if __name__ == '__main__':
start = time.perf_counter()
asyncio.run(main())
end = time.perf_counter()
print('process finished in {} seconds'.format(end - start))

Unfortunately, asyncio isn't magic. Although you've put them in async functions, the BitMEXData().get_<foo> functions are not themselves async (i.e. you can't await them), and therefore block while they run. The concurrency in asyncio can only occur while awaiting something.
You'll need a library which makes the actual HTTP requests asynchronously, like aiohttp. It sounds like you wrote bordemwrapper yourself - you should rewrite the get_<foo> functions to use asynchronous HTTP requests. Feel free to submit a separate question if you need help with that.

aiohttp + uvloop parallel HTTP requests are slower than without uvloop

I'm writing a script to make millions of API calls in parallel.
I'm using Python 3.6 with aiohttp for this purpose.
I was expecting that uvloop would make it faster, but it seems to have made it slower. Am I doing something wrong?
with uvloop: 22 seconds
without uvloop: 15 seconds
import asyncio
import aiohttp
import uvloop
import time
import logging
from aiohttp import ClientSession, TCPConnector
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
urls = ["http://www.yahoo.com","http://www.bbcnews.com","http://www.cnn.com","http://www.buzzfeed.com","http://www.walmart.com","http://www.emirates.com","http://www.kayak.com","http://www.expedia.com","http://www.apple.com","http://www.youtube.com"]
bigurls = 10 * urls
def run(enable_uvloop):
try:
if enable_uvloop:
loop = uvloop.new_event_loop()
else:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
start = time.time()
conn = TCPConnector(limit=5000, use_dns_cache=True, loop=loop, verify_ssl=False)
with ClientSession(connector=conn) as session:
tasks = asyncio.gather(*[asyncio.ensure_future(do_request(url, session)) for url in bigurls]) # tasks to do
results = loop.run_until_complete(tasks) # loop until done
end = time.time()
logger.debug('total time:')
logger.debug(end - start)
return results
loop.close()
except Exception as e:
logger.error(e, exc_info=True)
async def do_request(url, session):
"""
"""
try:
async with session.get(url) as response:
resp = await response.text()
return resp
except Exception as e:
logger.error(e, exc_info=True)
run(True)
#run(False)

aiohttp recommends to use aiodns
also, as i remember, this with ClientSession(connector=conn) as session: should be async

You're not alone; I actually just got similar results (which led me to google my findings and brought me here).
My experiment involves running 500 concurrent GET requests to Google.com using aiohttp.
Here is the code for reference:
import asyncio, aiohttp, concurrent.futures
from datetime import datetime
import uvloop
class UVloopTester():
def __init__(self):
self.timeout = 20
self.threads = 500
self.totalTime = 0
self.totalRequests = 0
#staticmethod
def timestamp():
return f'[{datetime.now().strftime("%H:%M:%S")}]'
async def getCheck(self):
async with aiohttp.ClientSession() as session:
response = await session.get('https://www.google.com', timeout=self.timeout)
response.close()
await session.close()
return True
async def testRun(self, id):
now = datetime.now()
try:
if await self.getCheck():
elapsed = (datetime.now() - now).total_seconds()
print(f'{self.timestamp()} Request {id} TTC: {elapsed}')
self.totalTime += elapsed
self.totalRequests += 1
except concurrent.futures._base.TimeoutError: print(f'{self.timestamp()} Request {id} timed out')
async def main(self):
await asyncio.gather(*[asyncio.ensure_future(self.testRun(x)) for x in range(self.threads)])
def start(self):
# comment these lines to toggle
uvloop.install()
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
loop = asyncio.get_event_loop()
now = datetime.now()
loop.run_until_complete(self.main())
elapsed = (datetime.now() - now).total_seconds()
print(f'{self.timestamp()} Main TTC: {elapsed}')
print()
print(f'{self.timestamp()} Average TTC per Request: {self.totalTime / self.totalRequests}')
if len(asyncio.Task.all_tasks()) > 0:
for task in asyncio.Task.all_tasks(): task.cancel()
try: loop.run_until_complete(asyncio.gather(*asyncio.Task.all_tasks()))
except asyncio.CancelledError: pass
loop.close()
test = UVloopTester()
test.start()
I haven't planned out and executed any sort of careful experiment where I'm logging my findings and calculating standard deviations and p-values. But I have run this a (tiring) number of times and have come up with the following results.
Running without uvloop:
loop.run_until_complete(main()) takes about 10 seconds.
average time to complete for request takes about 4 seconds.
Running with uvloop:
loop.run_until_complete(main()) takes about 16 seconds.
average time to complete for request takes about 8.5 seconds.
I've shared this code with a friend of mine who is actually the one who suggested I try uvloop (since he gets a speed boost from it). Upon running it several times, his results confirm that he does in fact see an increase in speed from using uvloop (shorter time to complete for both main() and requests on average).
Our findings lead me to believe that the differences in our findings have to do with our setups: I'm using a Debian virtual machine with 8 GB RAM on a mid-tier laptop while he's using a native Linux desktop with a lot more 'muscle' under the hood.
My answer to your question is: No, I do not believe you are doing anything wrong because I am experiencing the same results and it does not appear that I am doing anything wrong although any constructive criticism is welcome and appreciated.
I wish I could be of more help; I hope my chiming in can be of some use.

I tried a similar experiment and see no real difference between uvloop and asyncio event loops for parallel http GET's:
asyncio event loop: avg=3.6285968542099 s. stdev=0.5583842811362075 s.
uvloop event loop: avg=3.419699764251709 s. stdev=0.13423859428541632 s.
It might be that the noticeable benefits of uvloop come into play when it is used in server code, i.e. for handling many incoming requests.
Code:
import time
from statistics import mean, stdev
import asyncio
import uvloop
import aiohttp
urls = [
'https://aws.amazon.com', 'https://google.com', 'https://microsoft.com', 'https://www.oracle.com/index.html'
'https://www.python.org', 'https://nodejs.org', 'https://angular.io', 'https://www.djangoproject.com',
'https://reactjs.org', 'https://www.mongodb.com', 'https://reinvent.awsevents.com',
'https://kafka.apache.org', 'https://github.com', 'https://slack.com', 'https://authy.com',
'https://cnn.com', 'https://fox.com', 'https://nbc.com', 'https://www.aljazeera.com',
'https://fly4.emirates.com', 'https://www.klm.com', 'https://www.china-airlines.com',
'https://en.wikipedia.org/wiki/List_of_Unicode_characters', 'https://en.wikipedia.org/wiki/Windows-1252'
]
def timed(func):
async def wrapper():
start = time.time()
await func()
return time.time() - start
return wrapper
#timed
async def main():
conn = aiohttp.TCPConnector(use_dns_cache=False)
async with aiohttp.ClientSession(connector=conn) as session:
coroutines = [fetch(session, url) for url in urls]
await asyncio.gather(*coroutines)
async def fetch(session, url):
async with session.get(url) as resp:
await resp.text()
asycio_results = [asyncio.run(main()) for i in range(10)]
print(f'asyncio event loop: avg={mean(asycio_results)} s. stdev={stdev(asycio_results)} s.')
# Change to uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
uvloop_results = [asyncio.run(main()) for i in range(10)]
print(f'uvloop event loop: avg={mean(uvloop_results)} s. stdev={stdev(uvloop_results)} s.')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run in parallel queries via asyncpg? [duplicate] - python

Related

How to write a DataFrame to csv while ingesting data via an API with asyncio and aiohttp

Running blocking function (f.e. requests) concurrently but asynchronous with Python

FastAPI asynchronous background tasks blocks other requests?

No speedup using asyncio despite awaiting API response

aiohttp + uvloop parallel HTTP requests are slower than without uvloop

Categories

Resources