Connection pooling is supposed to improve the throughput of postgres or at least this is what everyone says when googling what is pooling and its benefits, however whenever I experiment with connection pooling in flask the result is always that it is significantly slower then just opening a connection and a cursor at the beginning of a file and never closing them . If my webapp is constantly getting requests from users why do we even close the connection and the cursor, isn't it better to create a connection and a cursor once and then whenever we get a request whether a GET or POST request simply use the existing cursor and connection . Am I missing something here ?!
Here is the timing for each approach and below is the code I ran to benchmark each approach
it took 16.537524700164795 seconds to finish 100000 queries with one database connection that we opened once at the begining of the flask app and never closed
it took 38.07477355003357 seconds to finish 100000 queries with the psqycopg2 pooling approach
it took 52.307902574539185 seconds to finish 100000 queries with pgbouncer pooling approach
also here is a video running the test with the results in case it is of any help
https://youtu.be/V2kzKApDs8Y
The flask app that I used to benchmark each approach is
import psycopg2
import time
from psycopg2 import pool
from flask import Flask
app = Flask(__name__)
connection_pool = pool.SimpleConnectionPool(1, 50,host="localhost",database="test",user="postgres",password="test",port="5432")
connection = psycopg2.connect(host="127.0.0.1",database="test",user="postgres",password="test",port="5432")
cursor = connection.cursor()
pgbouncerconnection_pool = pool.SimpleConnectionPool(1, 50, host="127.0.0.1",database="test",user="postgres",password="test",port="6432")
#app.route("/poolingapproach")
def zero():
start = time.time()
for x in range(100000):
with connection_pool.getconn() as connectionp:
with connectionp.cursor() as cursorp:
cursorp.execute("SELECT * from tb1 where id = %s" , [x%100])
result = cursorp.fetchone()
connection_pool.putconn(connectionp)
y = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with the pooling approach"
return str(y) , 200
#app.route("/pgbouncerpooling")
def one():
start = time.time()
for x in range(100000):
with pgbouncerconnection_pool.getconn() as pgbouncer_connection:
with pgbouncer_connection.cursor() as pgbouncer_cursor:
pgbouncer_cursor.execute("SELECT * from tb1 where id = %s" , [x%100])
result = pgbouncer_cursor.fetchone()
pgbouncerconnection_pool.putconn(pgbouncer_connection)
a = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with pgbouncer pooling approach"
return str(a) , 200
#app.route("/oneconnection_at_the_begining")
def two():
start = time.time()
for x in range(100000):
cursor.execute("SELECT * from tb1 where id = %s",[x%100])
result = cursor.fetchone()
end = time.time()
x = 'it took ' + str(end - start)+ ' seconds to finish 100000 queries with one database connection that we don\'t close'
return str(x) , 200
if __name__=="__main__":
app.run()
I am not sure how you're testing, but the idea of the connection pool is to address basically 2 things:
you have way more users trying to connect to a db that a db can directly handle (e.g. your db allow 100 connections and there are 1000 users trying to use it at the same time) and
save time that you would spend opening and closing connections because they're always open already.
So I believe your tests must be focused on these two situations.
If your test is just 1 user requesting a lot of queries, connection pool will be just an overhead, because it will try to open extra connections that you just don't need.
If your test always return the same data, your DBMS will probably cache the results and the query plan, so it will affect the results. In fact, in order to scale things, you could even benefit from a secondary cache such as elasticsearch.
Now, if you want to perform a realistic test, you must mix read and write operations, add some random variables into it (forcing the DB not to cache results or query plans) and try for incremental loads, so you can see how the system behaves each time you add more simultaneous clients performing requests.
And because these clients also add more CPU load to the test, you may also consider to run the clients in a different machine than the one that serves your DB, to keep the results fair.
Related
I am doing a poc where I wanted to implement connection pooling in python. I took help of snowflake-sqlalchemy as my target is snowflake database. I wanted to check the following scenarios.
parallely run multiple stored procedures/sql statements.
If the number of SP/SQL which need to execute in parallel is more then the Pool connections, the code should wait. I am writing a while loop for this purpose.
I have defined a function func1 and it is executing in parallel multiple times to simulate multiple thread processing.
Problem Description:
Everything is working fine. The pool size is 2 and max_overflow is 5. When the code is running, it is running 7 parallel SP's/SQL's instead of 10. This is understood that the pool only can support 7 connections to run in parallel. If the rest has to execute, they will wait till one of the thread execution is complete. On the snowflake side, the session is established, and the query executes and then rollback is being triggered. I am unable to figure out why roll back is triggered. I tried to explicitly issue a commit statement also from my code. It executes commit and then roll back also is triggered. The second problem is, even if I am not closing the connection, after the thread execution is complete, the connection is being closed automatically and the next set of statements are triggered. Ideally, I was expecting that the thread is not going to release the connection since there is no closure of connection.
Summary: I wanted to know two things.
Why is the roll back happening.
Why is the connection getting released after thread execution is complete even if the connection is not returned to pool.
May I please request for help in understanding the execution process. I am stuck and unable to figure out what's happening.
import time
from concurrent.futures import ThreadPoolExecutor, wait
from functools import partial
from sqlalchemy.pool import QueuePool
from snowflake.connector import connect
def get_conn():
return connect(
user='<username>',
password="<password>",
account='<account>',
warehouse='<WH>',
database='<DB>',
schema='<Schema>',
role='<role>'
)
pool = QueuePool(get_conn, max_overflow=5, pool_size=2, timeout=10.0)
def func1():
conn = pool.connect()
print("Connection is created and SP is triggered")
print("Pool size is:", pool.size())
print("Checked OUt connections are:", pool.checkedout())
print("Checked in Connections are:", pool.checkedin())
print("Overflow connections are:", pool.overflow())
print("current status of pool is:", pool.status())
c = conn.cursor()
# c.execute("select current_session()")
c.execute("call SYSTEM$WAIT(60)")
# c.execute("commit")
# conn.close()
func_list = []
for _ in range(0, 10):
func_list.append(partial(func1))
print(func_list)
proc = []
with ThreadPoolExecutor() as executor:
for func in func_list:
print("Calling func")
while (pool.checkedout() == pool.size() + pool._max_overflow):
print("Max connections reached. Waiting for an existing connection to release")
time.sleep(15)
p = executor.submit(func)
proc.append(p)
wait(proc)
pool.dispose()
print("process complete")
Using Google Cloud PostgreSQL the execution time of a simple query is 3ms on the Google Console, but when I send the request from my Python on my Mac, it takes 4 seconds to get the respond and print it.
I do create a connection every time I run the script :
engine = sqlalchemy.create_engine('postgresql://' + username + ':' + password + '#' + host + ':' + port + '/' + database)
and then I send query with pandas:
df = pd.read_sql_query(e, engine)
Is this something that slow down the round trip? Should I not create a connection every time? How can I get a much faster respond ?
3ms for the query grows to 4 seconds getting the final respond - is this going to improve if I run it as a web client sending a normal REST API request ?
I recommend you to test the performance inside the code. This way you could measure the time it takes it line of your code and decide if this is because your environment ( for example the building time of your script) or it is taking more time due to the response from the server side.
To to that you can use the time Python library. An example on how to do this could be:
import time
start_time = time.time()
#
# Your code stuff
#
elapsed_time = time.time() - start_time
print('It took {} seconds to run your code'.format(elapsed_time))
I have a Flask web app that might need to execute some heavy SQL queries via Sqlalchemy given user input. I would like to set a timeout for the query, let's say 20 seconds so if a query takes more than 20 seconds, the server will display an error message to user so they can try again later or with smaller inputs.
I have tried with both multiprocessing and threading modules, both with the Flask development server and Gunicorn without success: the server keeps blocking and no error message is returned. You'll find below an excerpt of the code.
How do you handle slow SQL query in Flask in a user friendly way?
Thanks.
from multiprocessing import Process
#app.route("/long_query")
def long_query():
query = db.session(User)
def run_query():
nonlocal query
query = query.all()
p = Process(target=run_query)
p.start()
p.join(20) # timeout of 20 seconds
if p.is_alive():
p.terminate()
return render_template("error.html", message="please try with smaller input")
return render_template("result.html", data=query)
I would recommend using Celery or something similar (people use python-rq for simple workflows).
Take a look at Flask documentation regarding Celery: http://flask.pocoo.org/docs/0.12/patterns/celery/
As for dealing with the results of a long-running query: you can create an endpoint for requesting task results and have client application periodically checking this endpoint until the results are available.
As leovp mentioned Celery is the way to go if you are working on a long-term project. However, if you are working on a small project where you want something easy to setup, I would suggest going with RQ and it's flask plugin. Also think very seriously about terminating processes while they are querying the database, since they might not be able to clean up after themselves (e.g. releasing locks they have on the db)
Well if you really want to terminate queries on a timeout, I would suggest you use a database that supports it (PostgreSQL is one). I will assume you use PostgreSQL for this section.
from sqlalchemy.interfaces import ConnectionProxy
class ConnectionProxyWithTimeouts(ConnectionProxy):
def cursor_execute(self, execute, cursor, statement, parameters, context, executemany):
timeout = context.execution_options.get('timeout', None)
if timeout:
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO %d;' % int(timeout * 1000))
c.close()
ret = execute(cursor, statement, parameters, context)
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO 0')
c.close()
return ret
else:
return execute(cursor, statement, parameters, context)
Then when you created an engine you would your own connection proxy
engine = create_engine(URL, proxy=TimeOutProxy(), pool_size=1, max_overflow=0)
And you could query then like this
User.query.execution_options(timeout=20).all()
If you want to use the code above, use it only as a base for your own implementation, since I am not 100% sure it's bug free.
I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:
synchronous: 3.07 seconds (1.34s and 1.29s for each query)
asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())
Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that
An asynchronous query returns a response immediately, generally before
the query completes.
Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!
EDIT NOTES:
Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:
creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
startTime to endTime: 744ms, 1005ms
Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.
From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:
You can run multiple jobs concurrently in BigQuery
CODE:
For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:
def test_sync(self, request):
t0 = time.time()
request = bigquery.jobs()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t1 = time.time()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t2 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("elapsed: " + str(t2 - t0))
def test_async(self, request):
job_ids = {}
t0 = time.time()
job_id = async_query(sql)
job_ids['a'] = job_id
print("job_id: " + job_id)
t1 = time.time()
job_id = async_query(sql)
job_ids['b'] = job_id
print("job_id: " + job_id)
t2 = time.time()
for key, value in job_ids.iteritems():
response = bigquery.jobs().getQueryResults(
jobId=value,
projectId=project_id).execute()
t3 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("2-3: " + str(t3 - t2))
print("elapsed: " + str(t3 - t0))
def async_query(sql):
job_data = {
'jobReference': {
'projectId': project_id
},
'configuration': {
'query': {
'query': sql,
'priority': 'INTERACTIVE'
}
}
}
response = bigquery.jobs().insert(
projectId=project_id,
body=job_data).execute()
job_id = response['jobReference']['jobId']
return job_id
The answer to whether running queries in parallel will speed up the results is, of course, "it depends".
When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.
The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.
Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.
So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.
My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.
There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.
That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.
On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break