What's causing so much overhead in Google BigQuery query? - python

I am running the following function to profile a BigQuery query:
# q = "SELECT * FROM bqtable LIMIT 1'''
def run_query(q):
t0 = time.time()
client = bigquery.Client()
t1 = time.time()
res = client.query(q)
t2 = time.time()
results = res.result()
t3 = time.time()
records = [_ for _ in results]
t4 = time.time()
print (records[0])
print ("Initialize BQClient: %.4f | ExecuteQuery: %.4f | FetchResults: %.4f | PrintRecords: %.4f | Total: %.4f | FromCache: %s" % (t1-t0, t2-t1, t3-t2, t4-t3, t4-t0, res.cache_hit))
And, I get something like the following:
Initialize BQClient: 0.0007 | ExecuteQuery: 0.2854 | FetchResults: 1.0659 | PrintRecords: 0.0958 | Total: 1.4478 | FromCache: True
I am running this on a GCP machine and it is only fetching ONE result in location US (same region, etc.), so the network transfer should (I hope?) be negligible. What's causing all the overhead here?
I tried this on the GCP console and it says the cache hit takes less than 0.1s to return, but in actuality, it's over a second. Here is an example video to illustrate: https://www.youtube.com/watch?v=dONZH1cCiJc.
Notice for the first query, for example, it says it returned in 0.253s from cache:
However, if you view the above video, the query actually STARTED at 7 seconds and 3 frames --
And it COMPLETED at 8 seconds and 13 frames --
That is well over a second -- almost a second and a half!! That number is similar to what I get when I execute a query from the command-line in python.
So why then does it report that it only took 0.253s when in actuality, to do the query and return the one result, it takes over five times that amount?
In other words, it seems like there's about a second overhead REGARDLESS of the query time (which are not noted at all in the execution details). Are there any ways to reduce this time?

The UI is reporting the query execution time, not the total time.
Query execution time is how long it takes BigQuery to actually scan the data and compute the result. If it's just reading from cache then it will be very quick and usually under 1 second, which reflects the timing you're seeing.
However that doesn't include downloading the result table and displaying it in the UI. You actually measured this in your Python script which shows the FetchResults step taking over 1 second, and this is the same thing that's happening in the browser console. For example, a cached query result containing millions of rows will be executed very quickly but might take 30 seconds to fully download.
BigQuery is a large-scale analytical (OLAP) system and is designed for throughput rather than latency. It uses a distributed design with an intensive planning process and writes all results to temporary tables. This allows it to process petabytes in seconds but the trade-off is that every query will take a few seconds to run, no matter how small.
You can look at the official documentation for more info on query planning and performance, but in this situation there is no way to reduce the latency any further. A few seconds is currently the best case scenario for BigQuery.
If you need lower response times for repeated queries then you can look into storing the results in your own caching layer (like Redis), or use BigQuery to aggregate data into a much smaller dataset and then store that in a traditional relational database (like Postgres or MySQL).

Related

Python concurrent futures multiprocessing Pool does not scale with the number of processors

I have written a simple function to demonstrate this behavior which iteratively creates a list and I pass that function to the concurrent.futures.ProcessPoolExecutor. The actual function isn't important as this seems to happen for a wide variety of functions I've tested. As I increase the number of processors it takes longer to run the underlying function. At only 10 processors the total execution time per processor increases by 2.5 times! For this function it continues to increase at a rate of about 15% per processor up to the capacity limits of my machine. I have a Windows machine with 48 processors and my total CPU and memory usage doesn't exceed 25% for this test. I have nothing else running. Is there some blocking lurking somewhere?
from datetime import datetime
import concurrent.futures
def process(num_jobs=1,**kwargs) :
from functools import partial
iterobj = range(num_jobs)
args = []
func = globals()['test_multi']
with concurrent.futures.ProcessPoolExecutor(max_workers=num_jobs) as ex:
## using map
result = ex.map(partial(func,*args,**kwargs),iterobj)
return result
def test_multi(*args,**kwargs):
starttime = datetime.utcnow()
iternum = args[-1]
test = []
for i in range(200000):
test = test + [i]
return iternum, (datetime.utcnow()-starttime)
if __name__ == '__main__' :
max_processors = 10
for i in range(max_processors):
starttime = datetime.utcnow()
result = process(i+1)
finishtime = datetime.utcnow()-starttime
if i == 0:
chng = 0
total = 0
firsttime = finishtime
else:
chng = finishtime/lasttime*100 - 100
total = finishtime/firsttime*100 - 100
lasttime = finishtime
print(f'Multi took {finishtime} for {i+1} processes changed by {round(chng,2)}%, total change {round(total,2)}%')
This gives the following results on my machine:
Multi took 0:00:52.433927 for 1 processes changed by 0%, total change 0%
Multi took 0:00:52.597822 for 2 processes changed by 0.31%, total change 0.31%
Multi took 0:01:13.158140 for 3 processes changed by 39.09%, total change 39.52%
Multi took 0:01:26.666043 for 4 processes changed by 18.46%, total change 65.29%
Multi took 0:01:43.412213 for 5 processes changed by 19.32%, total change 97.22%
Multi took 0:01:41.687714 for 6 processes changed by -1.67%, total change 93.93%
Multi took 0:01:38.316035 for 7 processes changed by -3.32%, total change 87.5%
Multi took 0:01:51.106467 for 8 processes changed by 13.01%, total change 111.9%
Multi took 0:02:15.046646 for 9 processes changed by 21.55%, total change 157.56%
Multi took 0:02:13.467514 for 10 processes changed by -1.17%, total change 154.54%
The increases are not linear and vary from test to test but always end up significantly increasing the time to run the function. Given the ample free resources on this machine and very simple function I would have expected the total time to remain fairly constant or perhaps slightly increase with the spawning of new processes, not to increase dramatically from pure calculation.
yes there is, and it's called MEMORY BANDWIDTH.
while the memory controller is good at pipelining read/write instructions to improve throughput for parallel programs, if too many applications are reading/writing to your RAM sticks then you are going to see a slowdown because the RAM pipeline is being bombarded from all applications at the same time.
other applications running at the same time may not be using the RAM as heavily because each core has cache (L1,L2 and a shared L3) to keep applications running without contesting on the RAM bandwidth, so only applications that do heavy memory operations will be contesting on the RAM bandwidth, and your application is clearly contesting with itself on the RAM bandwidth.
this is one of the hard limits on parallel programs, and the solution is obviously to make a more "Cache friendly" application that reaches out for your RAM less often.
"Cache friendly applications" are more easily written in C/C++ than python, but they are totally doable in python as computers have a few MBs of cache which can fit an entire application in a lot of cases.

Why is traceback.extract_stack() in Python so slow?

During tests I found out that calling traceback.extract_stack() is very slow. The price for getting stack track is comparable to executing a database query.
I'm wondering if I'm doing something wrong or missing something. What's surprising for me that I suppose calling extract_stack() is an internal call in Python, it's executed during the runtime in memory and should be super fast if not instant. In contrast calling database query involves external service (network communication) etc.
Example code is below. You can try how much time it takes to retrieve traceback in let say 20.000 iterations and how fast it is retrieving just first few items from the stack trace - set the limit=None parameter to something else.
My tests showed various results on various systems/configurations but all have in common that calling a stack trace is not orderds of magnitude cheaper, its almost the same as calling SQL insert.
20k SQL inserts | 20k stack traces
Win 5.4 sec 14.4 sec
FreeBSD 5.0 sec 3.7 sec
Ubuntu GCP 16.6 sec 2.4 sec
Windows: laptop, local SSD. FreeBSD: server, local SSD. Ubuntu: Google Cloud, shared SSD.
Am I doing something wrong or is there any explanation why is traceback.extract_stack() so slow? Can I retrieve stack trace somehow faster?
Example code. Run $ pip install pytest and then $ pytest -s -v
import datetime
import unittest
import traceback
class TestStackTrace(unittest.TestCase):
def test_stack_trace(self):
start_time = datetime.datetime.now()
iterations = 20000
for i in range(0, iterations):
stack_list = traceback.extract_stack(limit=None) # set 0, 1, 2...
stack_len = len(stack_list)
self.assertEqual(1, 1)
finish_time = datetime.datetime.now()
print('\nStack length: {}, iterations: {}'.format(stack_len, iterations))
print('Trace elapsed time: {}'.format(finish_time - start_time))
You don't need it but if you want the comparison with SQL insert, here it is. Just insert it as a second test method in the TestStackTrace class. Run CREATE DATABASE pytest1; and CREATE TABLE "test_table1" (num_value BIGINT, str_value VARCHAR(10));
def test_sql_query(self):
start_time = datetime.datetime.now()
con_str = "host='127.0.0.1' port=5432 user='postgres' password='postgres' dbname='pytest1'"
con = psycopg2.connect(con_str)
con.autocommit = True
con.set_session(isolation_level='READ COMMITTED')
cur = con.cursor()
for i in range(0, 20000):
cur.execute('INSERT INTO test_table1 (num_value, str_value) VALUES (%s, %s) ', (i, i))
finish_time = datetime.datetime.now()
print('\nSQL elapsed time: {}'.format(finish_time - start_time))
traceback.extract_stack() is not an internal call in Python implemented in C. The entire traceback module is implemented in Python, which is why it is relatively slow. Since stack traces are typically only required during debugging, its performance usually isn't a concern. You may have to re-implement it as a C/C++ extension on your own if you really need a high-performance version of it.

Running asynchronous queries in BigQuery not noticeably faster

I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:
synchronous: 3.07 seconds (1.34s and 1.29s for each query)
asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())
Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that
An asynchronous query returns a response immediately, generally before
the query completes.
Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!
EDIT NOTES:
Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:
creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
startTime to endTime: 744ms, 1005ms
Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.
From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:
You can run multiple jobs concurrently in BigQuery
CODE:
For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:
def test_sync(self, request):
t0 = time.time()
request = bigquery.jobs()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t1 = time.time()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t2 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("elapsed: " + str(t2 - t0))
def test_async(self, request):
job_ids = {}
t0 = time.time()
job_id = async_query(sql)
job_ids['a'] = job_id
print("job_id: " + job_id)
t1 = time.time()
job_id = async_query(sql)
job_ids['b'] = job_id
print("job_id: " + job_id)
t2 = time.time()
for key, value in job_ids.iteritems():
response = bigquery.jobs().getQueryResults(
jobId=value,
projectId=project_id).execute()
t3 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("2-3: " + str(t3 - t2))
print("elapsed: " + str(t3 - t0))
def async_query(sql):
job_data = {
'jobReference': {
'projectId': project_id
},
'configuration': {
'query': {
'query': sql,
'priority': 'INTERACTIVE'
}
}
}
response = bigquery.jobs().insert(
projectId=project_id,
body=job_data).execute()
job_id = response['jobReference']['jobId']
return job_id
The answer to whether running queries in parallel will speed up the results is, of course, "it depends".
When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.
The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.
Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.
So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.
My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.
There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.
That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.

How to best share static data between ipyparallel client and remote engines?

I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.

Is it possible to increase the response timeout in Google App Engine?

On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break

Categories

Resources