I'm new to Flask/Gunicorn and have a very basic understanding of SQL.
I have a Flask app that connects to a remote oracle database with cx_oracle. Depending on the app route selected, it runs one of two queries. I run the app using gunicorn -w 4 flask:app. The first query is a simple query on a table with ~70000 rows and is very responsive. The second one is more complex, and queries several tables, one of which contains ~150 million rows. Through sprinkling print statements around, I notice that sometimes the second query never even starts, especially if it is not the first app.route selected by the user and they're both to be running concurrently. Opening the app.route('/') multiple times will trigger its query multiple times quickly and run it in parallel, but not with app.route('/2'). I have multiple workers enabled, and threaded=True for oracle. Why is this happening? Is it doomed to be slow/downright unresponsive due to the size of the table?
import cx_Oracle
from flask import Flask
import pandas as pd
app = Flask(__name__)
connection = cx_Oracle.connect("name","pwd", threaded=True)
#app.route('/')
def Q1():
print("start q1")
querystring=""" select to_char(to_date(col1,'mm/dd/yy'),'Month'), sum(col2)
FROM tbl1"""
df=pd.read_sql(querystring=,con=connection)
print("q1 complete")
#app.route('/2')
def Q2():
print("start q2")
querystring=""" select tbl2.col1,
tbl2.col2,
tbl3.col3
FROM tbl2 INNER JOIN
tbl3 ON tbl2.col1 = tbl3.col1
WHERE tbl2.col2 like 'X%' AND
tbl2.col4 >=20180101"""
df=pd.read_sql(querystring=,con=connection)
print("q2 complete")
I have tried exporting the datasets for each query as csvs and have pandas read the csvs instead, in this scenario, both reads are can run concurrently very well, and doesn't miss a beat. Is this a SQL issue, thread issue, worker issue?
Be aware that a connection can only process one thing at a time. If the connection is busy executing one of the queries, it can't execute the other one. Once execution is complete and fetching has begun the two can operate together, but each one has to wait for the other one to complete its fetch operation before the other one can begin. To get around this you should use a session pool (http://cx-oracle.readthedocs.io/en/latest/module.html#cx_Oracle.SessionPool) and then in each of your routes add this code:
connection = pool.acquire()
None of that will help the performance of the one query, but at least it will prevent interference from it!
Related
I want to execute multiple queries without each blocking other. I created multiple cursors and did the following but got mysql.connector.errors.OperationalError: 2013 (HY000): Lost connection to MySQL server during query
import mysql.connector as mc
from threading import Thread
conn = mc.connect(#...username, password)
cur1 = conn.cursor()
cur2 = conn.cursor()
e1 = Thread(target=cur1.execute, args=("do sleep(30)",)) # A 'time taking' task
e2 = Thread(target=cur2.execute, args=("show databases",)) # A simple task
e1.start()
e2.start()
But I got that OperationalError. And reading a few other questions, some suggest that using multiple connections is better than multiple cursors. So shall I use multiple connections?
I don't have the full context of your situation to understand the performance considerations. Yes, starting a new connection could be considered heavy if you are operating under strict timing constraints that are short relative to the time it takes to start a new connection and you were forced to do that for every query...
But you can mitigate that with a shared connection pool that you create ahead of time, and then distribute your queries (in separate threads) over those connections as resources allow.
On the other hand, if all of your query times are fairly long relative to the time it takes to create a new connection, and you aren't looking to run more than a handful of queries in parallel, then it can be a reasonable option to create connections on demand. Just be aware that you will run into limits with the number of open connections if you try to go too far, as well as resource limitations on the database system itself. You probably don't want to do something like that against a shared database. Again, this is only a reasonable option within some very specific contexts.
I'm currently using mysql.connector in a python Flask project and, after users enter their information, the following query is executed:
"SELECT first, last, email, {} FROM {} WHERE {} <= {} AND ispaired IS NULL".format(key, db, class_data[key], key)
It would pose a problem if this query was executed in 2 threads concurrently, and returned the same row in both threads. I was wondering if there was a way to prevent SELECT mysql queries from executing concurrently, or if this was already the default behavior of mysql.connector? For additional information, all mysql.connector queries are executed after being authenticated with the same account credentials.
It is hard to say from your description, but if you're using Flask, you're most probably using (or will use in production) multiple processes, and you probably have a connection pool (i.e. multiple connections) in each process. So while each connection is executing queries sequentially, this query can be ran concurrently by multiple connections at the same time.
To prevent your application from obtaining the same row at the same time while handling different requests, you should use transactions and techniques like SELECT FOR UPDATE. The exact solution depends on your exact use case.
I am working on a Flask application that interacts with Microsoft SQL server using the pypyodbc library. Several pages require a long database query to load the page. We ran into the problem that while Python is waiting for an answer from the database no other requests are served.
So far our attempt at running the queries in an asynchronous way is captured in this testcase:
from aiohttp import web
from multiprocessing.pool import ThreadPool
import asyncio
import _thread
pool = ThreadPool(processes=1)
def query_db(query, args=(), one=False):
conn = pypyodbc.connect(CONNECTION_STRING)
cur = conn.cursor()
cur.execute(query, args)
result = cur.fetchall()
conn.commit()
def get(data):
# Base query
query = # query that takes ~10 seconds
result = query_db(query, [])
return result
def slow(request):
loop = asyncio.get_event_loop()
#result = loop.run_in_executor(None, get, [])
result = pool.apply_async(get, (1,) )
x = result.get()
return web.Response(text="slow")
def fast(request):
return web.Response(text="fast")
if __name__ == '__main__':
app = web.Application()
app.router.add_get('/slow', slow)
app.router.add_get('/fast', fast)
web.run_app(app, host='127.0.0.1', port=5561)
This did not work, as requesting the slow and fast page in that order still had the fast load waiting until the slow load was done.
I tried to find alternatives, but could not find a solution that works and fits our environment:
aiopg.sa can do asynchronous queries, but switching away from SQL server to PostgreSQL is not an option
uwsgi seems to be usable with Flask to support multiple threads, but it cannot be pip-installed on Windows
Celery seems similar to our current approach, but it would need a message broker, which is non-trivial to set up on our system
It sounds like the issue isn’t the Flask application itself but rather your WSGI server. Flask is designed to handle one request at a time. To serve the app so that multiple people can hit it simultaneously, you should configure the WSGI server to use more workers. If a request hits the server while your long process is running, a worker will generate a new instance of the app and serve the request. This is easy to set up in IIS.
Of course, if you have four workers and then four clients run the long function simultaneously then you’re back in this situation. If that will happen frequently you can assign more workers or move to a different WSGI framework that supports async like Quart or Sanic.
The approach you’ve outlined above should speed up the execution of the long process, though. But Flask itself is not designed to await. It holds the thread until it’s finished.
More details in this answer: https://stackoverflow.com/a/19411051/5093960
I'm running PostgreSQL 9.3 and SQLAlchemy 0.8.2 and experience database connections leaking. After deploying the app consumes around 240 connections. Over next 30 hours this number gradually grows to 500, when PostgreSQL will start dropping connections.
I use SQLAlchemy thread-local sessions:
from sqlalchemy import orm, create_engine
engine = create_engine(os.environ['DATABASE_URL'], echo=False)
Session = orm.scoped_session(orm.sessionmaker(engine))
For the Flask web app, the .remove() call to the Session proxy-object is send during request teardown:
#app.teardown_request
def teardown_request(exception=None):
if not app.testing:
Session.remove()
This should be the same as what Flask-SQLAlchemy is doing.
I also have some periodic tasks that run in a loop, and I call .remove() for every iteration of the loop:
def run_forever():
while True:
do_stuff(Session)
Session.remove()
What am I doing wrong which could lead to a connection leak?
If I remember correctly from my experiments with SQLAlchemy, the scoped_session() is used to create sessions that you can access from multiple places. That is, you create a session in one method and use it in another without explicitly passing the session object around.
It does that by keeping a list of sessions and associating them with a "scope ID". By default, to obtain a scope ID, it uses the current thread ID; so you have session per thread. You can supply a scopefunc to provide – for example – one ID per request:
# This is (approx.) what flask-sqlalchemy does:
from flask import _request_ctx_stack as context_stack
Session = orm.scoped_session(orm.sessionmaker(engine),
scopefunc=context_stack.__ident_func__)
Also, take note of the other answers and comments about doing background tasks.
First of all, it is a really really bad way to run background tasks. Try any ASync scheduler like celery.
Not 100% sure so this is a bit of a guess based on the information provided, but I wonder if each page load is starting a new db connection which is then listening for notifications. If this is the case, I wonder if the db connection is effectively removed from the pool and so gets created on the next page load.
If this is the case, my recommendation would be to have a separate DBI database handle dedicated to listening for notifications so that these are not active in the queue. This might be done outside your workflow.
Also
Particularly, the leak is happening when making more than one simultaneous requests. At the same time, I could see some of the requests were left with uncompleted query execution and timing out. You can write something to manage this yourself.
I'm using concurrent.futures module to run jobs in parallel. It runs quite fine.
The start time and completion time gets updated in the mysql database whenever a job starts/ends. Also, each job gets its input files from the database and saves the output files in the database. I'm getting the errors
"Error 2006:MySQL server has gone away"
and
"Error 2013: Lost connection to MySQL server during query" while running the script.
I don't face these errors while running a single job.
Sample Script:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor(max_workers=pool_size)
futures = []
for i in self.parent_job.child_jobs:
futures.append(executor.submit(invokeRunCommand, i))
def invokeRunCommand(self)
self.saveStartTime()
self.getInputFiles()
runShellCommand()
self.saveEndTime()
self.saveOutputFiles()
I'm using a single database connection and cursor to execute all the queries. Some queries are time consuming ones. Not sure of the reason for hitting this error. Could someone clarify?
-Thanks
Yes, a single connection to the database is not thread safe, so if you're using the same database connection for multiple threads, things will fail.
If your pseudo code is representative, just start and use a separate database connection for each thread in your invokeRunCommand and things should be fine.