Is doing asynchronous database calls in a flask application possible? - python

I am working on a Flask application that interacts with Microsoft SQL server using the pypyodbc library. Several pages require a long database query to load the page. We ran into the problem that while Python is waiting for an answer from the database no other requests are served.
So far our attempt at running the queries in an asynchronous way is captured in this testcase:
from aiohttp import web
from multiprocessing.pool import ThreadPool
import asyncio
import _thread
pool = ThreadPool(processes=1)
def query_db(query, args=(), one=False):
conn = pypyodbc.connect(CONNECTION_STRING)
cur = conn.cursor()
cur.execute(query, args)
result = cur.fetchall()
conn.commit()
def get(data):
# Base query
query = # query that takes ~10 seconds
result = query_db(query, [])
return result
def slow(request):
loop = asyncio.get_event_loop()
#result = loop.run_in_executor(None, get, [])
result = pool.apply_async(get, (1,) )
x = result.get()
return web.Response(text="slow")
def fast(request):
return web.Response(text="fast")
if __name__ == '__main__':
app = web.Application()
app.router.add_get('/slow', slow)
app.router.add_get('/fast', fast)
web.run_app(app, host='127.0.0.1', port=5561)
This did not work, as requesting the slow and fast page in that order still had the fast load waiting until the slow load was done.
I tried to find alternatives, but could not find a solution that works and fits our environment:
aiopg.sa can do asynchronous queries, but switching away from SQL server to PostgreSQL is not an option
uwsgi seems to be usable with Flask to support multiple threads, but it cannot be pip-installed on Windows
Celery seems similar to our current approach, but it would need a message broker, which is non-trivial to set up on our system

It sounds like the issue isn’t the Flask application itself but rather your WSGI server. Flask is designed to handle one request at a time. To serve the app so that multiple people can hit it simultaneously, you should configure the WSGI server to use more workers. If a request hits the server while your long process is running, a worker will generate a new instance of the app and serve the request. This is easy to set up in IIS.
Of course, if you have four workers and then four clients run the long function simultaneously then you’re back in this situation. If that will happen frequently you can assign more workers or move to a different WSGI framework that supports async like Quart or Sanic.
The approach you’ve outlined above should speed up the execution of the long process, though. But Flask itself is not designed to await. It holds the thread until it’s finished.
More details in this answer: https://stackoverflow.com/a/19411051/5093960

Related

How can I disable threading in Flask?

I have the following bit of code:
import sqlite3
from flask import Flask
app = Flask(__name__)
db = sqlite3.connect('/etc/db.sqlite')
#app.route('/')
def handle():
# run a query and return a response
if __name__ == '__main__':
app.run('0.0.0.0', 8080, debug=True)
However, when I try to perform some operations on the database object in the request handler, I get the following exception from sqlite3 because it is not a thread-safe library and the query is run from a different thread that Flask spawns, and not from the main thread:
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 139886422697792 and this is thread id 139886332843776
I am aware that the "proper" way to do this is to have a function to create an instance of the sqlite3.Connection object and store it in the Flask g global, as outlined here: http://flask.pocoo.org/docs/1.0/patterns/sqlite3/. However, when running this application on production, I use gunicorn -w 4 -b 0.0.0.0:8080 app:app, and there it works fine, because the threads are spawned at the beginning in this case.
While the Flask g global method works in all cases, I would really like to avoid the overhead of creating and destroying sqlite3.Connection objects with every request. So, I would like to disable threading in Flask so that the above code can run without causing issues.
However, even when I change the last line of the above code to app.run(..., threaded=False), I am unable to avoid this error. It seems that Flask still spawns a thread for handling requests.
So, how can I disable threading with Flask?
Don't use sqlite3 module directly in flask. Use Flask_sqlalchemy
I had lots of trouble trying to set up databases on sqlite without it. As soon as I made the switch it was sooooo much easier. You can connect to multiple types of SQL databases too!
Flask sqlalchemy:
http://flask-sqlalchemy.pocoo.org/2.3/
Really the best guide for flask out there:
https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-iv-database
Use scoped_session to avoid this:
session = scoped_session(sessionmaker(bind=engine))()

Unresponsive requests- understanding the bottleneck (Flask + Oracle + Gunicorn)

I'm new to Flask/Gunicorn and have a very basic understanding of SQL.
I have a Flask app that connects to a remote oracle database with cx_oracle. Depending on the app route selected, it runs one of two queries. I run the app using gunicorn -w 4 flask:app. The first query is a simple query on a table with ~70000 rows and is very responsive. The second one is more complex, and queries several tables, one of which contains ~150 million rows. Through sprinkling print statements around, I notice that sometimes the second query never even starts, especially if it is not the first app.route selected by the user and they're both to be running concurrently. Opening the app.route('/') multiple times will trigger its query multiple times quickly and run it in parallel, but not with app.route('/2'). I have multiple workers enabled, and threaded=True for oracle. Why is this happening? Is it doomed to be slow/downright unresponsive due to the size of the table?
import cx_Oracle
from flask import Flask
import pandas as pd
app = Flask(__name__)
connection = cx_Oracle.connect("name","pwd", threaded=True)
#app.route('/')
def Q1():
print("start q1")
querystring=""" select to_char(to_date(col1,'mm/dd/yy'),'Month'), sum(col2)
FROM tbl1"""
df=pd.read_sql(querystring=,con=connection)
print("q1 complete")
#app.route('/2')
def Q2():
print("start q2")
querystring=""" select tbl2.col1,
tbl2.col2,
tbl3.col3
FROM tbl2 INNER JOIN
tbl3 ON tbl2.col1 = tbl3.col1
WHERE tbl2.col2 like 'X%' AND
tbl2.col4 >=20180101"""
df=pd.read_sql(querystring=,con=connection)
print("q2 complete")
I have tried exporting the datasets for each query as csvs and have pandas read the csvs instead, in this scenario, both reads are can run concurrently very well, and doesn't miss a beat. Is this a SQL issue, thread issue, worker issue?
Be aware that a connection can only process one thing at a time. If the connection is busy executing one of the queries, it can't execute the other one. Once execution is complete and fetching has begun the two can operate together, but each one has to wait for the other one to complete its fetch operation before the other one can begin. To get around this you should use a session pool (http://cx-oracle.readthedocs.io/en/latest/module.html#cx_Oracle.SessionPool) and then in each of your routes add this code:
connection = pool.acquire()
None of that will help the performance of the one query, but at least it will prevent interference from it!

Leaking database connections: PostgreSQL, SQLAlchemy, Flask

I'm running PostgreSQL 9.3 and SQLAlchemy 0.8.2 and experience database connections leaking. After deploying the app consumes around 240 connections. Over next 30 hours this number gradually grows to 500, when PostgreSQL will start dropping connections.
I use SQLAlchemy thread-local sessions:
from sqlalchemy import orm, create_engine
engine = create_engine(os.environ['DATABASE_URL'], echo=False)
Session = orm.scoped_session(orm.sessionmaker(engine))
For the Flask web app, the .remove() call to the Session proxy-object is send during request teardown:
#app.teardown_request
def teardown_request(exception=None):
if not app.testing:
Session.remove()
This should be the same as what Flask-SQLAlchemy is doing.
I also have some periodic tasks that run in a loop, and I call .remove() for every iteration of the loop:
def run_forever():
while True:
do_stuff(Session)
Session.remove()
What am I doing wrong which could lead to a connection leak?
If I remember correctly from my experiments with SQLAlchemy, the scoped_session() is used to create sessions that you can access from multiple places. That is, you create a session in one method and use it in another without explicitly passing the session object around.
It does that by keeping a list of sessions and associating them with a "scope ID". By default, to obtain a scope ID, it uses the current thread ID; so you have session per thread. You can supply a scopefunc to provide – for example – one ID per request:
# This is (approx.) what flask-sqlalchemy does:
from flask import _request_ctx_stack as context_stack
Session = orm.scoped_session(orm.sessionmaker(engine),
scopefunc=context_stack.__ident_func__)
Also, take note of the other answers and comments about doing background tasks.
First of all, it is a really really bad way to run background tasks. Try any ASync scheduler like celery.
Not 100% sure so this is a bit of a guess based on the information provided, but I wonder if each page load is starting a new db connection which is then listening for notifications. If this is the case, I wonder if the db connection is effectively removed from the pool and so gets created on the next page load.
If this is the case, my recommendation would be to have a separate DBI database handle dedicated to listening for notifications so that these are not active in the queue. This might be done outside your workflow.
Also
Particularly, the leak is happening when making more than one simultaneous requests. At the same time, I could see some of the requests were left with uncompleted query execution and timing out. You can write something to manage this yourself.

Enable multithreading of my web app using python Bottle framework

I have a web app written with Bottle framework. It have a global somedict list accessed by multiple HTTP query.
After some researching, I find that the Bottle framework only support 1 thread in 1 process mode to run my app(I don't believe it is true, perhaps migrating it to other frameworks like Flask is a good idea.).
1 To enable multi-threading, I find WSGI solution but it does not support multiple processs(1 threads for each process) accessing global variable like somedict in my app, because process will re-init the list every time a query gets handled. How can I handle this issue?
2 Is there any other solutions except WSGI that solve the problem to enable this app to serve multiple HTTP query at once?
from bottle import request, route
import threading
somedict = {}
somedict_lock = threading.Lock()
#route("/read")
def read():
with somedict_lock:
return somedict
#route("/write", method="POST")
def write():
with somedict_lock:
somedict[request.forms.get("key1")] = request.forms.get("value1")
somedict[request.forms.get("key2")] = request.forms.get("value2")
It's best to serve a WSGI app via a server like gunicorn or waitress, which will handle your concurrency needs, but almost no matter what you do for concurrency your global queue in memory will not work the way you want it to. You need to use an external memory store like memcached, redis, etc. Static data is one thing, but mutable state should never be shared between web app processes. That's contrary to Python web server idioms and the typical execution model of Python web apps.
I'm not saying it's literally impossible to do in Python, but it's not the way Python solves this problem.
You can process incoming requests asynchronously, currently Celery seems very suitable for running asynchronous tasks. Read how Celery can do this.

Django Asynchronous Processing

I have a bunch of Django requests which executes some mathematical computations ( written in C and executed via a Cython module ) which may take an indeterminate amount ( on the order of 1 second ) of time to execute. Also the requests don't need to access the database and are all independent of each other and Django.
Right now everything is synchronous ( using Gunicorn with sync worker types ) but I'd like to make this asynchronous and nonblocking. In short I'd like to do something:
Receive the AJAX request
Allocate task to an available worker ( without blocking the main Django web application )
Worker executes task in some unknown amount of time
Django returns the result of the computation (a list of strings) as JSON whenever the task completes
I am very new to asynchronous Django, and so my question is what is the best stack for doing this.
Is this sort of process something a task queue is well suited for? Would anyone recommend Tornado + Celery + RabbitMQ, or perhaps something else?
Thanks in advance!
Celery would be perfect for this.
Since what you're doing is relatively simple (read: you don't need complex rules about how tasks should be routed), you could probably get away with using the Redis backend, which means you don't need to setup/configure RabbitMQ (which, in my experience, is more difficult).
I use Redis with the most a dev build of Celery, and here are the relevant bits of my config:
# Use redis as a queue
BROKER_BACKEND = "kombu.transport.pyredis.Transport"
BROKER_HOST = "localhost"
BROKER_PORT = 6379
BROKER_VHOST = "0"
# Store results in redis
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = "0"
I'm also using django-celery, which makes the integration with Django happy.
Comment if you need any more specific advice.
Since you are planning to make it async (presumably using something like gevent), you could also consider making a threaded/forked backend web service for the computational work.
The async frontend server could handle all the light work, get data from databases that are suitable for async (redis or mysql with a special driver), etc. When a computation has to be done, the frontend server can post all input data to the backend server and retrieve the result when the backend server is done computing it.
Since the frontend server is async, it will not block while waiting for the results. The advantage of this as opposed to using celery, is that you can return the result to the client as soon as it becomes available.
client browser <> async frontend server <> backend server for computations

Categories

Resources