Running asynchronous queries in BigQuery not noticeably faster

Running asynchronous queries in BigQuery not noticeably faster - python

I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:
synchronous: 3.07 seconds (1.34s and 1.29s for each query)
asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())
Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that
An asynchronous query returns a response immediately, generally before
the query completes.
Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!
EDIT NOTES:
Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:
creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
startTime to endTime: 744ms, 1005ms
Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.
From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:
You can run multiple jobs concurrently in BigQuery
CODE:
For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:
def test_sync(self, request):
t0 = time.time()
request = bigquery.jobs()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t1 = time.time()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t2 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("elapsed: " + str(t2 - t0))
def test_async(self, request):
job_ids = {}
t0 = time.time()
job_id = async_query(sql)
job_ids['a'] = job_id
print("job_id: " + job_id)
t1 = time.time()
job_id = async_query(sql)
job_ids['b'] = job_id
print("job_id: " + job_id)
t2 = time.time()
for key, value in job_ids.iteritems():
response = bigquery.jobs().getQueryResults(
jobId=value,
projectId=project_id).execute()
t3 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("2-3: " + str(t3 - t2))
print("elapsed: " + str(t3 - t0))
def async_query(sql):
job_data = {
'jobReference': {
'projectId': project_id
},
'configuration': {
'query': {
'query': sql,
'priority': 'INTERACTIVE'
}
}
}
response = bigquery.jobs().insert(
projectId=project_id,
body=job_data).execute()
job_id = response['jobReference']['jobId']
return job_id

The answer to whether running queries in parallel will speed up the results is, of course, "it depends".
When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.
The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.
Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.
So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.
My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.
There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.
That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.

Related

why does connection pooling slow down the sql queries in flask

Connection pooling is supposed to improve the throughput of postgres or at least this is what everyone says when googling what is pooling and its benefits, however whenever I experiment with connection pooling in flask the result is always that it is significantly slower then just opening a connection and a cursor at the beginning of a file and never closing them . If my webapp is constantly getting requests from users why do we even close the connection and the cursor, isn't it better to create a connection and a cursor once and then whenever we get a request whether a GET or POST request simply use the existing cursor and connection . Am I missing something here ?!
Here is the timing for each approach and below is the code I ran to benchmark each approach
it took 16.537524700164795 seconds to finish 100000 queries with one database connection that we opened once at the begining of the flask app and never closed
it took 38.07477355003357 seconds to finish 100000 queries with the psqycopg2 pooling approach
it took 52.307902574539185 seconds to finish 100000 queries with pgbouncer pooling approach
also here is a video running the test with the results in case it is of any help
https://youtu.be/V2kzKApDs8Y
The flask app that I used to benchmark each approach is
import psycopg2
import time
from psycopg2 import pool
from flask import Flask
app = Flask(__name__)
connection_pool = pool.SimpleConnectionPool(1, 50,host="localhost",database="test",user="postgres",password="test",port="5432")
connection = psycopg2.connect(host="127.0.0.1",database="test",user="postgres",password="test",port="5432")
cursor = connection.cursor()
pgbouncerconnection_pool = pool.SimpleConnectionPool(1, 50, host="127.0.0.1",database="test",user="postgres",password="test",port="6432")
#app.route("/poolingapproach")
def zero():
start = time.time()
for x in range(100000):
with connection_pool.getconn() as connectionp:
with connectionp.cursor() as cursorp:
cursorp.execute("SELECT * from tb1 where id = %s" , [x%100])
result = cursorp.fetchone()
connection_pool.putconn(connectionp)
y = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with the pooling approach"
return str(y) , 200
#app.route("/pgbouncerpooling")
def one():
start = time.time()
for x in range(100000):
with pgbouncerconnection_pool.getconn() as pgbouncer_connection:
with pgbouncer_connection.cursor() as pgbouncer_cursor:
pgbouncer_cursor.execute("SELECT * from tb1 where id = %s" , [x%100])
result = pgbouncer_cursor.fetchone()
pgbouncerconnection_pool.putconn(pgbouncer_connection)
a = "it took " + str(time.time() - start) + " seconds to finish 100000 queries with pgbouncer pooling approach"
return str(a) , 200
#app.route("/oneconnection_at_the_begining")
def two():
start = time.time()
for x in range(100000):
cursor.execute("SELECT * from tb1 where id = %s",[x%100])
result = cursor.fetchone()
end = time.time()
x = 'it took ' + str(end - start)+ ' seconds to finish 100000 queries with one database connection that we don\'t close'
return str(x) , 200
if __name__=="__main__":
app.run()

I am not sure how you're testing, but the idea of the connection pool is to address basically 2 things:
you have way more users trying to connect to a db that a db can directly handle (e.g. your db allow 100 connections and there are 1000 users trying to use it at the same time) and
save time that you would spend opening and closing connections because they're always open already.
So I believe your tests must be focused on these two situations.
If your test is just 1 user requesting a lot of queries, connection pool will be just an overhead, because it will try to open extra connections that you just don't need.
If your test always return the same data, your DBMS will probably cache the results and the query plan, so it will affect the results. In fact, in order to scale things, you could even benefit from a secondary cache such as elasticsearch.
Now, if you want to perform a realistic test, you must mix read and write operations, add some random variables into it (forcing the DB not to cache results or query plans) and try for incremental loads, so you can see how the system behaves each time you add more simultaneous clients performing requests.
And because these clients also add more CPU load to the test, you may also consider to run the clients in a different machine than the one that serves your DB, to keep the results fair.

Google Cloud Datastore: Counting all entities in a certain state

Background
I need to send out a large batch of notifications to around ~1 mil devices and I'm building it out using Google Cloud Functions.
In the current setup I enqueue each device token as a PubSub message that:
stores a pending notification in DataStore, used for keeping track of retries and success status
attempts to send the notification
marks the notification as either successful or failed if it's retried enough and hasn't gone through
This works more or less fine and I get decent performance out of this, something 1.5K tokens processed per second.
Issue
I want to keep track of the current progress of the whole job. Given that I know how many notifications I'm expecting to process I want to do be able to report something like x/1_000_000 processed and then consider it done when the sum of failures + successes is as much as what I wanted to process.
The DataStore documentation suggests not running a count on the entities themselves because it won't be performant, which I can confirm. I implemented a counter following their example documentation of a sharded counter which I'm including at the end.
The issue I'm seeing is that it is both quite slow and very prone to returning 409 Contention errors which makes my function invocations retry which is not ideal given that the count itself is not essential to the process and there's only a limited retry budget per notification. In practice the thing that fails the most is incrementing the counter which happens at the end of the process which would increase load on notification reads to check their status on retry and means that I end up with a counter that is less than the actual successful notifications.
I ran a quick benchmark using wrk and seem to get around 400 RPS out of incrementing the counter with an average latency of 250ms. This is quite slow comparing to the notification logic itself that does around 3 DataStore queries per notification and is presumably more complex than incrementing a counter. When added to the contention errors I end up with an implementation that I don't consider stable. I understand that Datastore usually auto-scales with continuous heavy usage but the pattern of using this service is very rare and for the whole batch of tokens so there would not be any previous traffic to scale this up.
Questions
Is there something I'm missing about the counter implementation that could be improved to make it less slow?
Is there a different approach I should consider to get what I want?
Code
The code that interacts with datastore
DATASTORE_READ_BATCH_SIZE = 100
class Counter():
kind = "counter"
shards = 2000
#staticmethod
def _key(namespace, shard):
return hashlib.sha1(":".join([str(namespace), str(shard)]).encode('utf-8')).hexdigest()
#staticmethod
def count(namespace):
keys = []
total = 0
for shard in range(Counter.shards):
if len(keys) == DATASTORE_READ_BATCH_SIZE:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
keys = []
keys.append(client.key(Counter.kind, Counter._key(namespace, shard)))
if len(keys) != 0:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
return total
#staticmethod
def increment(namespace):
key = client.key(Counter.kind, Counter._key(namespace, random.randint(0, Counter.shards - 1)))
with client.transaction():
entity = client.get(key)
if entity is None:
entity = datastore.Entity(key=key)
entity.update({
"count": 0,
})
entity.update({
"count": entity["count"] + 1,
})
client.put(entity)
This is called from a Google Cloud Function like so
from flask import abort, jsonify, make_response
from src.notify import FCM, APNS
from src.lib.datastore import Counter
def counter(request):
args = request.args
if args.get("platform"):
Counter.increment(args["platform"])
return
return jsonify({
FCM: Counter.count(FCM),
APNS: Counter.count(APNS)
})
This is used both for incrementing and reading the counts and is split by platform for iOS and Android.

In the end I gave up on the counter and started also saving the status of the notifications in BigQuery. The pricing is still reasonable as it’s still per use and the streaming version of data inserting seems to be fast enough that it doesn’t cause me any issues in practice.
With this I can use a simple sql query to count all the entities matching a batched job. This ends up taking something around 3 seconds for all the entities which, compared to the alternative is acceptable performance for me given that this is only for internal use.

What's causing so much overhead in Google BigQuery query?

I am running the following function to profile a BigQuery query:
# q = "SELECT * FROM bqtable LIMIT 1'''
def run_query(q):
t0 = time.time()
client = bigquery.Client()
t1 = time.time()
res = client.query(q)
t2 = time.time()
results = res.result()
t3 = time.time()
records = [_ for _ in results]
t4 = time.time()
print (records[0])
print ("Initialize BQClient: %.4f | ExecuteQuery: %.4f | FetchResults: %.4f | PrintRecords: %.4f | Total: %.4f | FromCache: %s" % (t1-t0, t2-t1, t3-t2, t4-t3, t4-t0, res.cache_hit))
And, I get something like the following:
Initialize BQClient: 0.0007 | ExecuteQuery: 0.2854 | FetchResults: 1.0659 | PrintRecords: 0.0958 | Total: 1.4478 | FromCache: True
I am running this on a GCP machine and it is only fetching ONE result in location US (same region, etc.), so the network transfer should (I hope?) be negligible. What's causing all the overhead here?
I tried this on the GCP console and it says the cache hit takes less than 0.1s to return, but in actuality, it's over a second. Here is an example video to illustrate: https://www.youtube.com/watch?v=dONZH1cCiJc.
Notice for the first query, for example, it says it returned in 0.253s from cache:
However, if you view the above video, the query actually STARTED at 7 seconds and 3 frames --
And it COMPLETED at 8 seconds and 13 frames --
That is well over a second -- almost a second and a half!! That number is similar to what I get when I execute a query from the command-line in python.
So why then does it report that it only took 0.253s when in actuality, to do the query and return the one result, it takes over five times that amount?
In other words, it seems like there's about a second overhead REGARDLESS of the query time (which are not noted at all in the execution details). Are there any ways to reduce this time?

The UI is reporting the query execution time, not the total time.
Query execution time is how long it takes BigQuery to actually scan the data and compute the result. If it's just reading from cache then it will be very quick and usually under 1 second, which reflects the timing you're seeing.
However that doesn't include downloading the result table and displaying it in the UI. You actually measured this in your Python script which shows the FetchResults step taking over 1 second, and this is the same thing that's happening in the browser console. For example, a cached query result containing millions of rows will be executed very quickly but might take 30 seconds to fully download.
BigQuery is a large-scale analytical (OLAP) system and is designed for throughput rather than latency. It uses a distributed design with an intensive planning process and writes all results to temporary tables. This allows it to process petabytes in seconds but the trade-off is that every query will take a few seconds to run, no matter how small.
You can look at the official documentation for more info on query planning and performance, but in this situation there is no way to reduce the latency any further. A few seconds is currently the best case scenario for BigQuery.
If you need lower response times for repeated queries then you can look into storing the results in your own caching layer (like Redis), or use BigQuery to aggregate data into a much smaller dataset and then store that in a traditional relational database (like Postgres or MySQL).

How to best share static data between ipyparallel client and remote engines?

I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?

I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.

Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.

Is it possible to increase the response timeout in Google App Engine?

On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui

You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html

The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break

Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.