FastAPI paralellism with pandas.read_sql()

FastAPI paralellism with pandas.read_sql() - python

I have the following class in my code, which manages all the database connections:
class Database:
# There's more code here, but the important part is the one below
def get_status(self) -> dict:
self.connect_db() # Connects to the database and store it's connection in self.conn
df_data = pd.read_sql("""SELECT col1, col2, col3 FROM table1""", self.conn)
df_info = pd.read_sql("""SELECT col1, col2, col3 FROM table2""", self.conn)
self.disconnect_db() # Closes the database connection
return [df_data.to_dict(orient="list"), df_info.to_dict(orient="list")]
db = Database()
I have to call db.get_status() in a FastAPI route:
#app.get("/api/get_status")
async def get_status_api():
return db.get_status()
The problem is, it takes a lot of time to complete, and while it is running, the entire website is blocked.
I tried parallelism with asyncio, however the get_status() function long time happens because of a CPU-intensive operation, not because of the database request.
Besides asyncio, I've already tried the following:
#app.get("/api/get_status")
async def get_status_api():
data = {}
thread = threading.Thread(target=db.get_status, args=(data,)) # Passing data as argument to simulate the returning value
thread.start()
thread.join()
return data
#app.get("/api/get_status")
async def get_status_api():
data = {}
thread = multiprocessing.Process(target=db.get_status, args=(data,)) # Passing data as argument to simulate the returning value
thread.start()
thread.join()
return data
#app.get("/api/get_status")
async def get_status_api():
with ThreadPoolExecutor() as executor:
data = list(executor.map(db.get_status, [None]))[0] #Altered the db.get_status() signature to get_status(self, _)
return data
But no luck so far. So, how can I not block the entire website while pd.read_sql() is running? Taking a long time to run the query is fine, as long as it can handle parallel requests.

As MatsLindh pointed out, the solution was simply to remove the async from the route function, as FastAPI internally already run it on a thread.
#app.get("/api/get_status")
def get_status_api():
return db.get_status()
Altering the route function to the above worked.

Related

ConnectionAbortedError: [WinError 10053] when trying to connect to itself with a web app

I have just ran into a funny situation when testing my FastAPI Python application and thought it might be useful for some of the people who reuse sessions in their apps and want to test requests using the same app, but get stuck on weir errors like the one in the title.
Also I desire to know what is happening here.
Context
I have an async FastAPI application, that schedules multiple requests based on a unimportant configuration. After the list of request definitions is prepared, a session is created the requests are sent, possibly with delays so I can spread them in time.
To test if the requests are getting through, I have cretaed routes in my own app so I can send the testing requests back to my own application. The application basically talks to itself.
It was listening on 127.0.0.1:8000 at the time of testing.
I have following functions defined for building async tasks:
def optional_session(func):
async def wrapper(*args, **kwargs):
if 'session' not in kwargs or kwargs['session'] is None:
async with ClientSession() as session:
kwargs['session'] = session
return await func(*args, **kwargs)
else:
return await func(*args, **kwargs)
return wrapper
#optional_session
async def post_json_with_time_from_url(url: str, data: dict, session: ClientSession = None) -> Tuple[Union[dict, None], float]:
"""
A method that performs a request to a specified URL and reads the response as JSON data.
If the request is successful the data is returned. If an error occurs it is logged and the returned data is None.
:param data: data to send i the request
:param url: The URL to retrieve the image from
:return: A valid response or None
:param session:
"""
result = None, time.time()
try:
async with session.post(url, data=data) as response: # type: ClientResponse
# check if the response is valid
if response.status == 200:
try:
# we have to read the response before leaving the response context manager
result = await response.json(), time.time()
except Exception as e:
logger.error("...")
else:
logger.error(
"...")
except InvalidURL as e:
logger.error(f"...")
except Exception as e:
logger.error("...")
return result
def delay(func, seconds: int):
""""
This decorator adds a time delay to an async function.
"""
if seconds is None:
seconds = 0
async def wrapper(*args, **kwargs):
await asyncio.sleep(seconds)
return await func(*args, **kwargs)
return wrapper
def parse_get_post_request(config: ConfigContext, session: aiohttp.ClientSession = None) -> asyncio.Task:
"""
Parses the get/post request from the configuration dictionary and creates an async task for it.
"""
request_type = config.extract_key('request_type', True).lower()
delay_ = config.extract_key('delay')
url_base_ = config.extract_key('request_url_base', True)
url_suffix_ = config.extract_key('request_url_suffix', True)
url_ = urljoin(base=url_base_, url=url_suffix_)
if request_type == 'get':
return asyncio.ensure_future(
delay(get_json_with_time_from_url, delay_)(url=url_, session=session)
)
elif request_type == 'post':
return asyncio.ensure_future(
delay(post_json_with_time_from_url, delay_)(url=url_, session=session, data=config.extract_key('request_data'))
)
else:
raise ValueError(f"Unsupported request type: {request_type}")
I am creating an aiohttp session like this:
async with aiohttp.ClientSession() as session:
...
and then reusing it throughout the context code block somehting like this:
single_request_tasks = []
...
for config in configs:
single_request_tasks.append(parse_get_post_request(config=plan_config, session=session))
...
responses = await asyncio.gather(*single_request_tasks)
...
Problem
Somehow, when I send the requests altogether, and one of the requests arrives back to the app at the same time as another one, an exception is thrown:
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
It turns out, that for some reason, the session I share for all the requests is terminated when multiple requests arrive at the same time, using the same ClientSession instance.
I am not really sure why this happens exactly, apart from suspecting some port clash shanenigans,
but it is resolved, when I use separate session for each request or when I spread them in time with an interval of one second (for example)
Workaround
I have used separate sessions for each request when looping back to localhost.
I also avoided the issue, when I have spread the requests in time, so each one has time to complete before the other one is sent, but timing is not that reliable mechanism (since OS task scheduler, concurrency in asyncio, network latency, etc.)
This problem does not occur when sharing a session with a different host (for example when scraping images from imgur.com) so I believe the problem is related to the fact that I am looping back to the localhost.
Question
Why this happens exactly? Why is the session closed by the software in the situation I described?
Is there anything I am doing wrong with the session? How does Starlette handle loopback connections? Is this case-dependent and do I need to do more detective work somehow or is this a generally recognized, platform independent behaviour?

How to send a progress of operation in a FastAPI app?

I have deployed a fastapi endpoint,
from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()
#app.post('/work/test')
async def testing(files: List(UploadFile)):
for i in files:
.......
# do a lot of operations on each file
# after than I am just writing that processed data into mysql database
# cur.execute(...)
# cur.commit()
.......
# just returning "OK" to confirm data is written into mysql
return {"response" : "OK"}
I can request output from the API endpoint and its working fine for me perfectly.
Now, the biggest challenge for me to know how much time it is taking for each iteration. Because in the UI part (those who are accessing my API endpoint) I want to help them show a progress bar (TIME TAKEN) for each iteration/file being processed.
Is there any possible way for me to achieve it? If so, please help me out on how can I proceed further?
Thank you.

Approaches
Polling
The most preferred approach to track the progress of a task is polling:
After receiving a request to start a task on a backend:
Create a task object in the storage (e.g in-memory, redis and etc.). The task object must contain the following data: task ID, status (pending, completed), result, and others.
Run task in the background (coroutines, threading, multiprocessing, task queue like Celery, arq, aio-pika, dramatiq and etc.)
Response immediately the answer 202 (Accepted) by returning the previously received task ID.
Update task status:
This can be from within the task itself, if it knows about the task store and has access to it. Periodically, the task itself updates information about itself.
Or use a task monitor (Observer, producer-consumer pattern), which will monitor the status of the task and its result. And it will also update the information in the storage.
On the client side (front-end) start a polling cycle for the task status to endpoint /task/{ID}/status, which takes information from the task storage.
Streaming response
Streaming is a less convenient way of getting the status of request processing periodically. When we gradually push responses without closing the connection. It has a number of significant disadvantages, for example, if the connection is broken, you can lose information. Streaming Api is another approach than REST Api.
Websockets
You can also use websockets for real-time notifications and bidirectional communication.
Links:
Examples of polling approach for the progress bar and a more detailed description for django + celery can be found at these links:
https://www.dangtrinh.com/2013/07/django-celery-display-progress-bar-of.html
https://buildwithdjango.com/blog/post/celery-progress-bars/
I have provided simplified examples of running background tasks in FastAPI using multiprocessing here:
https://stackoverflow.com/a/63171013/13782669
Old answer:
You could run a task in the background, return its id and provide a /status endpoint that the front would periodically call. In the status response, you could return what state your task is now (for example, pending with the number of the currently processed file). I provided a few simple examples here.
Demo
Polling
Demo of the approach using asyncio tasks (single worker solution):
import asyncio
from http import HTTPStatus
from fastapi import BackgroundTasks
from typing import Dict, List
from uuid import UUID, uuid4
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
progress: int = 0
result: int = None
app = FastAPI()
jobs: Dict[UUID, Job] = {} # Dict as job storage
async def long_task(queue: asyncio.Queue, param: int):
for i in range(1, param): # do work and return our progress
await asyncio.sleep(1)
await queue.put(i)
await queue.put(None)
async def start_new_task(uid: UUID, param: int) -> None:
queue = asyncio.Queue()
task = asyncio.create_task(long_task(queue, param))
while progress := await queue.get(): # monitor task progress
jobs[uid].progress = progress
jobs[uid].status = "complete"
#app.post("/new_task/{param}", status_code=HTTPStatus.ACCEPTED)
async def task_handler(background_tasks: BackgroundTasks, param: int):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(start_new_task, new_task.uid, param)
return new_task
#app.get("/task/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Adapted example for loop from question
Background processing function is defined as def and FastAPI runs it on the thread pool.
import time
from http import HTTPStatus
from fastapi import BackgroundTasks, UploadFile, File
from typing import Dict, List
from uuid import UUID, uuid4
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
processed_files: List[str] = Field(default_factory=list)
app = FastAPI()
jobs: Dict[UUID, Job] = {}
def process_files(task_id: UUID, files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
jobs[task_id].processed_files.append(i.filename)
jobs[task_id].status = "completed"
#app.post('/work/test', status_code=HTTPStatus.ACCEPTED)
async def work(background_tasks: BackgroundTasks, files: List[UploadFile] = File(...)):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(process_files, new_task.uid, files)
return new_task
#app.get("/work/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Streaming
async def process_files_gen(files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
yield f"{i.filename} processed\n"
yield f"OK\n"
#app.post('/work/stream/test', status_code=HTTPStatus.ACCEPTED)
async def work(files: List[UploadFile] = File(...)):
return StreamingResponse(process_files_gen(files))

Below is solution which uses uniq identifiers and globally available dictionary which holds information about the jobs:
NOTE: Code below is safe to use until you use dynamic keys values ( In sample uuid in use) and keep application within single process.
To start the app create a file main.py
Run uvicorn main:app --reload
Create job entry by accessing http://127.0.0.1:8000/
Repeat step 3 to create multiple jobs
Go to http://127.0.0.1/status page to see page statuses.
Go to http://127.0.0.1/status/{identifier} to see progression of the job by the job id.
Code of app:
from fastapi import FastAPI, UploadFile
import uuid
from typing import List
import asyncio
context = {'jobs': {}}
app = FastAPI()
async def do_work(job_key, files=None):
iter_over = files if files else range(100)
for file, file_number in enumerate(iter_over):
jobs = context['jobs']
job_info = jobs[job_key]
job_info['iteration'] = file_number
job_info['status'] = 'inprogress'
await asyncio.sleep(1)
pending_jobs[job_key]['status'] = 'done'
#app.post('/work/test')
async def testing(files: List[UploadFile]):
identifier = str(uuid.uuid4())
context[jobs][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier, files), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/')
async def get_testing():
identifier = str(uuid.uuid4())
context['jobs'][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/status')
def status():
return {
'all': list(context['jobs'].values()),
}
#app.get('/status/{identifier}')
async def status(identifier):
return {
"status": context['jobs'].get(identifier, 'job with that identifier is undefined'),
}

Concurrent HTTP and SQL requests using async Python 3

first time trying asyncio and aiohttp.
I have the following code that gets urls from the MySQL database for GET requests. Gets the responses and pushes them to MySQL database.
if __name__ == "__main__":
database_name = 'db_name'
company_name = 'company_name'
my_db = Db(database=database_name) # wrapper class for mysql.connector
urls_dict = my_db.get_rest_api_urls_for_specific_company(company_name=company_name)
update_id = my_db.get_updateid()
my_db.get_connection(dictionary=True)
for url in urls_dict:
url_id = url['id']
url = url['url']
table_name = my_db.make_sql_table_name_by_url(url)
insert_query = my_db.get_sql_for_insert(table_name)
r = requests.get(url=url).json() # make the request
args = [json.dumps(r), update_id, url_id]
my_db.db_execute_one(insert_query, args, close_conn=False)
my_db.close_conn()
This works fine but to speed it up How can I run it asynchronously?
I have looked here, here and here but can't seem to get my head around it.
Here is what I have tried based on #Raphael Medaer's answer.
async def fetch(url):
async with ClientSession() as session:
async with session.request(method='GET', url=url) as response:
json = await response.json()
return json
async def process(url, update_id):
table_name = await db.make_sql_table_name_by_url(url)
result = await fetch(url)
print(url, result)
if __name__ == "__main__":
"""Get urls from DB"""
db = Db(database="fuse_src")
urls = db.get_rest_api_urls() # This returns list of dictionary
update_id = db.get_updateid()
url_list = []
for url in urls:
url_list.append(url['url'])
print(update_id)
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url, update_id) for url in url_list]))
I get an error in the process method:
TypeError: object str can't be used in 'await' expression
Not sure whats the problem?
Any code example specific to this would be highly appreciated.

Make this code asynchronous will not speed it up at all. Except if you consider to run a part of your code in "parallel". For instance you can run multiple (SQL or HTTP) queries in "same time". By doing asynchronous programming you will not execute code in "same time". Although you will get benefit of long IO tasks to execute other part of your code while you're waiting for IOs.
First of all, you'll have to use asynchronous libraries (instead of synchronous one).
mysql.connector could be replaced by aiomysql from aio-libs.
requests could be replaced by aiohttp
To execute multiple asynchronous tasks in "parallel" (for instance to replace your loop for url in urls_dict:), you have to read carefully about asyncio tasks and function gather.
I will not (re)write your code in an asynchronous way, however here are a few lines of pseudo code which could help you:
async def process(url):
result = await fetch(url)
await db.commit(result)
if __name__ == "__main__":
db = MyDbConnection()
urls = await db.fetch_all_urls()
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url) for url in urls]))

How to trigger a function after return statement in Flask

I have 2 functions.
1st function stores the data received in a list and 2nd function writes the data into a csv file.
I'm using Flask. Whenever a web service has been called it will store the data and send response to it, as soon as it sends response it triggers the 2nd function.
My Code:
from flask import Flask, flash, request, redirect, url_for, session
import json
app = Flask(__name__)
arr = []
#app.route("/test", methods=['GET','POST'])
def check():
arr.append(request.form['a'])
arr.append(request.form['b'])
res = {'Status': True}
return json.dumps(res)
def trigger():
df = pd.DataFrame({'x': arr})
df.to_csv("docs/xyz.csv", index=False)
return
Obviously the 2nd function is not called.
Is there a way to achieve this?
P.S: My real life problem is different where trigger function is time consuming and I don't want user to wait for it to finish execution.

One solution would be to have a background thread that will watch a queue. You put your csv data in the queue and the background thread will consume it. You can start such a thread before first request:
import threading
from multiprocessing import Queue
class CSVWriterThread(threading.Thread):
def __init__(self, *args, **kwargs):
threading.Thread.__init__(self, *args, **kwargs)
self.input_queue = Queue()
def send(self, item):
self.input_queue.put(item)
def close(self):
self.input_queue.put(None)
self.input_queue.join()
def run(self):
while True:
csv_array = self.input_queue.get()
if csv_array is None:
break
# Do something here ...
df = pd.DataFrame({'x': csv_array})
df.to_csv("docs/xyz.csv", index=False)
self.input_queue.task_done()
time.sleep(1)
# Done
self.input_queue.task_done()
return
#app.before_first_request
def activate_job_monitor():
thread = CSVWriterThread()
app.csvwriter = thread
thread.start()
And in your code put the message in the queue before returning:
#app.route("/test", methods=['GET','POST'])
def check():
arr.append(request.form['a'])
arr.append(request.form['b'])
res = {'Status': True}
app.csvwriter.send(arr)
return json.dumps(res)

P.S: My real life problem is different where trigger function is time consuming and I don't want user to wait for it to finish execution.
Consider using celery which is made for the very problem you're trying to solve. From docs:
Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.
I recommend you integrate celery with your flask app as described here. your trigger method would then become a straightforward celery task that you can execute without having to worry about long response time.

Im actually working on another interesting case on my side where i pass the work off to a python worker that sends the job to a redis queue. There are some great blogs using redis with Flask , you basically need to ensure redis is running (able to connect on port 6379)
The worker would look something like this:
import os
import redis
from rq import Worker, Queue, Connection
listen = ['default']
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
conn = redis.from_url(redis_url)
if __name__ == '__main__':
with Connection(conn):
worker = Worker(list(map(Queue, listen)))
worker.work()
In my example I have a function that queries a database for usage and since it might be a lengthy process i pass it off to the worker (running as a seperate script)
def post(self):
data = Task.parser.parse_args()
job = q.enqueue_call(
func=migrate_usage, args=(my_args),
result_ttl=5000
)
print("Job ID is: {}".format(job.get_id()))
job_key = job.get_id()
print(str(Job.fetch(job_key, connection=conn).result))
if job:
return {"message": "Job : {} added to queue".format(job_key)}, 201
Credit due to the following article:
https://realpython.com/flask-by-example-implementing-a-redis-task-queue/#install-requirements

You can try use streaming. See next example:
import time
from flask import Flask, Response
app = Flask(__name__)
#app.route('/')
def main():
return '''<div>start</div>
<script>
var xhr = new XMLHttpRequest();
xhr.open('GET', '/test', true);
xhr.onreadystatechange = function(e) {
var div = document.createElement('div');
div.innerHTML = '' + this.readyState + ':' + this.responseText;
document.body.appendChild(div);
};
xhr.send();
</script>
'''
#app.route('/test')
def test():
def generate():
app.logger.info('request started')
for i in range(5):
time.sleep(1)
yield str(i)
app.logger.info('request finished')
yield ''
return Response(generate(), mimetype='text/plain')
if __name__ == '__main__':
app.run('0.0.0.0', 8080, True)
All magic in this example in genarator where you can start response data, after do some staff and yield empty data to end your stream.
For details look at http://flask.pocoo.org/docs/patterns/streaming/.

You can defer route specific actions with limited context by combining after_this_request and response.call_on_close. Note that request and response context won't be available but the route function context remains available. So you'll need to copy any request/response data you'll need into local variables for deferred access.
I moved your array to a local var to show how the function context is preserved. You could change your csv write function to an append so you're not pushing data endlessly into memory.
from flask import Flask, flash, request, redirect, url_for, session
import json
app = Flask(__name__)
#app.route("/test", methods=['GET','POST'])
def check():
arr = []
arr.append(request.form['a'])
arr.append(request.form['b'])
res = {'Status': True}
#flask.after_this_request
def add_close_action(response):
#response.call_on_close
def process_after_request():
df = pd.DataFrame({'x': arr})
df.to_csv("docs/xyz.csv", index=False)
return response
return json.dumps(res)

Am I using aiohttp together with psycopg2 correctly?

I'm very new to using asyncio/aiohttp, but I have a Python script that read a batch of URL:s from a Postgres table, downloads the URL:s, runs a processing function on each download (not relevant for the question), and saves back the result of the processing to the table.
In simplified form it looks like this:
import asyncio
import psycopg2
from aiohttp import ClientSession, TCPConnector
BATCH_SIZE = 100
def _get_pgconn():
return psycopg2.connect()
def db_conn(func):
def _db_conn(*args, **kwargs):
with _get_pgconn() as conn:
with conn.cursor() as cur:
return func(cur, *args, **kwargs)
conn.commit()
return _db_conn
async def run():
async with ClientSession(connector=TCPConnector(ssl=False, limit=100)) as session:
while True:
count = await run_batch(session)
if count == 0:
break
async def run_batch(session):
tasks = []
for url in get_batch():
task = asyncio.ensure_future(process_url(url, session))
tasks.append(task)
await asyncio.gather(*tasks)
results = [task.result() for task in tasks]
save_batch_result(results)
return len(results)
async def process_url(url, session):
try:
async with session.get(url, timeout=15) as response:
body = await response.read()
return process_body(body)
except:
return {...}
#db_conn
def get_batch(cur):
sql = "SELECT id, url FROM db.urls WHERE processed IS NULL LIMIT %s"
cur.execute(sql, (BATCH_SIZE,))
return cur.fetchall()
#db_conn
def save_batch_result(cur, results):
sql = "UPDATE db.urls SET a = %(a)s, processed = true WHERE id = %(id)s"
cur.executemany(sql, tuple(results))
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
But I have the feeling that I must be missing something here. The script runs but it seems to become slower and slower with each batch. Specially it seems like the call to the process_url function becomes slower over time. Also the used memory keeps growing so I'm guessing there might be something that I fail to clean up properly between runs?
I also have problems increasing the batch size much, if I go much over 200 I seem to get a much higher proportion of exceptions from the call to session.get. I have tried playing with the limit argument to the TCPConnector, setting it both higher and lower but I can't see that it helps much. Have also tried running it on a few different server but it seems to be the same. Is there some way to think about how to set these values more effectively?
Would be grateful for some pointers to what I might do wrong here!

The problem of your code is mixing asynchronous aiohttp library with synchronous psycopg2 client.
As a consequence calls to DB blocks the event loop entirely affecting all other parallel tasks.
To solve it you need to use asynchronous DB client: aiopg (a wrapper around psycopg2 async mode) or asyncpg (it has a different API but works faster).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

FastAPI paralellism with pandas.read_sql() - python

As MatsLindh pointed out, the solution was simply to remove the async from the route function, as FastAPI internally already run it on a thread. #app.get("/api/get_status") def get_status_api(): return db.get_status() Altering the route function to the above worked.

Related

ConnectionAbortedError: [WinError 10053] when trying to connect to itself with a web app

How to send a progress of operation in a FastAPI app?

Concurrent HTTP and SQL requests using async Python 3

How to trigger a function after return statement in Flask

Am I using aiohttp together with psycopg2 correctly?

Categories

Resources