Using Google Cloud PostgreSQL the execution time of a simple query is 3ms on the Google Console, but when I send the request from my Python on my Mac, it takes 4 seconds to get the respond and print it.
I do create a connection every time I run the script :
engine = sqlalchemy.create_engine('postgresql://' + username + ':' + password + '#' + host + ':' + port + '/' + database)
and then I send query with pandas:
df = pd.read_sql_query(e, engine)
Is this something that slow down the round trip? Should I not create a connection every time? How can I get a much faster respond ?
3ms for the query grows to 4 seconds getting the final respond - is this going to improve if I run it as a web client sending a normal REST API request ?
I recommend you to test the performance inside the code. This way you could measure the time it takes it line of your code and decide if this is because your environment ( for example the building time of your script) or it is taking more time due to the response from the server side.
To to that you can use the time Python library. An example on how to do this could be:
import time
start_time = time.time()
#
# Your code stuff
#
elapsed_time = time.time() - start_time
print('It took {} seconds to run your code'.format(elapsed_time))
Related
Connection pooling is supposed to improve the throughput of postgres or at least this is what everyone says when googling what is pooling and its benefits, however whenever I experiment with connection pooling in flask the result is always that it is significantly slower then just opening a connection and a cursor at the beginning of a file and never closing them . If my webapp is constantly getting requests from users why do we even close the connection and the cursor, isn't it better to create a connection and a cursor once and then whenever we get a request whether a GET or POST request simply use the existing cursor and connection . Am I missing something here ?!
Here is the timing for each approach and below is the code I ran to benchmark each approach
it took 16.537524700164795 seconds to finish 100000 queries with one database connection that we opened once at the begining of the flask app and never closed
it took 38.07477355003357 seconds to finish 100000 queries with the psqycopg2 pooling approach
it took 52.307902574539185 seconds to finish 100000 queries with pgbouncer pooling approach
also here is a video running the test with the results in case it is of any help
https://youtu.be/V2kzKApDs8Y
The flask app that I used to benchmark each approach is
import psycopg2
import time
from psycopg2 import pool
from flask import Flask
app = Flask(__name__)
connection_pool = pool.SimpleConnectionPool(1, 50,host="localhost",database="test",user="postgres",password="test",port="5432")
connection = psycopg2.connect(host="127.0.0.1",database="test",user="postgres",password="test",port="5432")
cursor = connection.cursor()
pgbouncerconnection_pool = pool.SimpleConnectionPool(1, 50, host="127.0.0.1",database="test",user="postgres",password="test",port="6432")
#app.route("/poolingapproach")
def zero():
start = time.time()
for x in range(100000):
with connection_pool.getconn() as connectionp:
with connectionp.cursor() as cursorp:
cursorp.execute("SELECT * from tb1 where id = %s" , [x%100])
result = cursorp.fetchone()
connection_pool.putconn(connectionp)
y = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with the pooling approach"
return str(y) , 200
#app.route("/pgbouncerpooling")
def one():
start = time.time()
for x in range(100000):
with pgbouncerconnection_pool.getconn() as pgbouncer_connection:
with pgbouncer_connection.cursor() as pgbouncer_cursor:
pgbouncer_cursor.execute("SELECT * from tb1 where id = %s" , [x%100])
result = pgbouncer_cursor.fetchone()
pgbouncerconnection_pool.putconn(pgbouncer_connection)
a = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with pgbouncer pooling approach"
return str(a) , 200
#app.route("/oneconnection_at_the_begining")
def two():
start = time.time()
for x in range(100000):
cursor.execute("SELECT * from tb1 where id = %s",[x%100])
result = cursor.fetchone()
end = time.time()
x = 'it took ' + str(end - start)+ ' seconds to finish 100000 queries with one database connection that we don\'t close'
return str(x) , 200
if __name__=="__main__":
app.run()
I am not sure how you're testing, but the idea of the connection pool is to address basically 2 things:
you have way more users trying to connect to a db that a db can directly handle (e.g. your db allow 100 connections and there are 1000 users trying to use it at the same time) and
save time that you would spend opening and closing connections because they're always open already.
So I believe your tests must be focused on these two situations.
If your test is just 1 user requesting a lot of queries, connection pool will be just an overhead, because it will try to open extra connections that you just don't need.
If your test always return the same data, your DBMS will probably cache the results and the query plan, so it will affect the results. In fact, in order to scale things, you could even benefit from a secondary cache such as elasticsearch.
Now, if you want to perform a realistic test, you must mix read and write operations, add some random variables into it (forcing the DB not to cache results or query plans) and try for incremental loads, so you can see how the system behaves each time you add more simultaneous clients performing requests.
And because these clients also add more CPU load to the test, you may also consider to run the clients in a different machine than the one that serves your DB, to keep the results fair.
I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:
synchronous: 3.07 seconds (1.34s and 1.29s for each query)
asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())
Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that
An asynchronous query returns a response immediately, generally before
the query completes.
Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!
EDIT NOTES:
Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:
creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
startTime to endTime: 744ms, 1005ms
Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.
From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:
You can run multiple jobs concurrently in BigQuery
CODE:
For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:
def test_sync(self, request):
t0 = time.time()
request = bigquery.jobs()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t1 = time.time()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t2 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("elapsed: " + str(t2 - t0))
def test_async(self, request):
job_ids = {}
t0 = time.time()
job_id = async_query(sql)
job_ids['a'] = job_id
print("job_id: " + job_id)
t1 = time.time()
job_id = async_query(sql)
job_ids['b'] = job_id
print("job_id: " + job_id)
t2 = time.time()
for key, value in job_ids.iteritems():
response = bigquery.jobs().getQueryResults(
jobId=value,
projectId=project_id).execute()
t3 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("2-3: " + str(t3 - t2))
print("elapsed: " + str(t3 - t0))
def async_query(sql):
job_data = {
'jobReference': {
'projectId': project_id
},
'configuration': {
'query': {
'query': sql,
'priority': 'INTERACTIVE'
}
}
}
response = bigquery.jobs().insert(
projectId=project_id,
body=job_data).execute()
job_id = response['jobReference']['jobId']
return job_id
The answer to whether running queries in parallel will speed up the results is, of course, "it depends".
When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.
The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.
Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.
So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.
My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.
There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.
That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.
I have a Django web app which is used by embedded systems to upload regular data, currently every 2 minutes, to the server where Django just pops it into a database.
I'd like to create an alert system where by if there's no data uploaded from the remote system in a time period, say 10 minutes for example, I raise an alarm on the server, via email or something.
In other programming languages/environments I'd create a 10 minute timer to execute a function in 10 minutes, but every time data is uploaded I'd restart the timer. Thus hopefully the timer would never expire and the expiry function would never get called.
I might well have missed something obvious but if there is something I have missed it. This just does not seem possible in Python. Have I missed something?
At present looks like I need an external daemon monitoring the database :-(
You could use the time module for this:
import time
def didEventHappen():
# insert appropriate logic here to check
# for what you want to check for every 10 minutes
value = True # this is just a placeholder so the code runs
return value
def notifyServer():
print("Hello server, the event happened")
start = time.clock()
delay = 10 * 60 # 10 minutes, converted to seconds
while True:
interval = time.clock() - start
eventHappened = False
if interval >= delay:
eventHappened = didEventHappen()
start = time.clock() # reset the timer
if eventHappened:
notifyServer()
else:
print("event did not happen")
Alternatively, you could use the sched module.
Through a python program, sending a command to specific device and that device is responding on the behalf of the command. Now I have to calculate timing between send and receive (means how much time taking to response of the command ).
Ex.
device ip - 10.0.0.10
transmitting 'L004' command through our local system to 10.0.10.
Receving 'L' response from 10.0.0.10.
So now I have to calculate time difference between start time and end time.
Please provide an API through that I can calculate.
import time
t1 = time.time()
# some time-demanding operations
t2 = time.time()
print "operation took around {0} seconds to complete".format(t2 - t1)
time.time() returns the current unix timestamp as a float number. Store this number at given points of your code and calculate the difference. You will get the time difference in seconds (and fractions).
The timeit standard module makes it easy to do this kind of task.
Just Use "timeit" module. It works with both Python 2 And Python 3
import timeit
start = timeit.default_timer()
#ALL THE PROGRAM STATEMETNS
stop = timeit.default_timer()
execution_time = stop - start
print("Program Executed in "+execution_time) #It returns time in sec
It returns in Seconds and you can have your Execution Time. Simple but you should write these in Main Function which starts program execution. If you want to get the Execution time even when you get error then take your parameter "Start" to it and calculate there like
`def sample_function(start,**kwargs):
try:
#your statements
Except:
#Except Statements
stop = timeit.default_timer()
execution_time = stop - start
On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break