Google Cloud Datastore: Counting all entities in a certain state

Google Cloud Datastore: Counting all entities in a certain state - python

Background
I need to send out a large batch of notifications to around ~1 mil devices and I'm building it out using Google Cloud Functions.
In the current setup I enqueue each device token as a PubSub message that:
stores a pending notification in DataStore, used for keeping track of retries and success status
attempts to send the notification
marks the notification as either successful or failed if it's retried enough and hasn't gone through
This works more or less fine and I get decent performance out of this, something 1.5K tokens processed per second.
Issue
I want to keep track of the current progress of the whole job. Given that I know how many notifications I'm expecting to process I want to do be able to report something like x/1_000_000 processed and then consider it done when the sum of failures + successes is as much as what I wanted to process.
The DataStore documentation suggests not running a count on the entities themselves because it won't be performant, which I can confirm. I implemented a counter following their example documentation of a sharded counter which I'm including at the end.
The issue I'm seeing is that it is both quite slow and very prone to returning 409 Contention errors which makes my function invocations retry which is not ideal given that the count itself is not essential to the process and there's only a limited retry budget per notification. In practice the thing that fails the most is incrementing the counter which happens at the end of the process which would increase load on notification reads to check their status on retry and means that I end up with a counter that is less than the actual successful notifications.
I ran a quick benchmark using wrk and seem to get around 400 RPS out of incrementing the counter with an average latency of 250ms. This is quite slow comparing to the notification logic itself that does around 3 DataStore queries per notification and is presumably more complex than incrementing a counter. When added to the contention errors I end up with an implementation that I don't consider stable. I understand that Datastore usually auto-scales with continuous heavy usage but the pattern of using this service is very rare and for the whole batch of tokens so there would not be any previous traffic to scale this up.
Questions
Is there something I'm missing about the counter implementation that could be improved to make it less slow?
Is there a different approach I should consider to get what I want?
Code
The code that interacts with datastore
DATASTORE_READ_BATCH_SIZE = 100
class Counter():
kind = "counter"
shards = 2000
#staticmethod
def _key(namespace, shard):
return hashlib.sha1(":".join([str(namespace), str(shard)]).encode('utf-8')).hexdigest()
#staticmethod
def count(namespace):
keys = []
total = 0
for shard in range(Counter.shards):
if len(keys) == DATASTORE_READ_BATCH_SIZE:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
keys = []
keys.append(client.key(Counter.kind, Counter._key(namespace, shard)))
if len(keys) != 0:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
return total
#staticmethod
def increment(namespace):
key = client.key(Counter.kind, Counter._key(namespace, random.randint(0, Counter.shards - 1)))
with client.transaction():
entity = client.get(key)
if entity is None:
entity = datastore.Entity(key=key)
entity.update({
"count": 0,
})
entity.update({
"count": entity["count"] + 1,
})
client.put(entity)
This is called from a Google Cloud Function like so
from flask import abort, jsonify, make_response
from src.notify import FCM, APNS
from src.lib.datastore import Counter
def counter(request):
args = request.args
if args.get("platform"):
Counter.increment(args["platform"])
return
return jsonify({
FCM: Counter.count(FCM),
APNS: Counter.count(APNS)
})
This is used both for incrementing and reading the counts and is split by platform for iOS and Android.

In the end I gave up on the counter and started also saving the status of the notifications in BigQuery. The pricing is still reasonable as it’s still per use and the streaming version of data inserting seems to be fast enough that it doesn’t cause me any issues in practice.
With this I can use a simple sql query to count all the entities matching a batched job. This ends up taking something around 3 seconds for all the entities which, compared to the alternative is acceptable performance for me given that this is only for internal use.

Related

Running asynchronous queries in BigQuery not noticeably faster

I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:
synchronous: 3.07 seconds (1.34s and 1.29s for each query)
asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())
Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that
An asynchronous query returns a response immediately, generally before
the query completes.
Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!
EDIT NOTES:
Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:
creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
startTime to endTime: 744ms, 1005ms
Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.
From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:
You can run multiple jobs concurrently in BigQuery
CODE:
For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:
def test_sync(self, request):
t0 = time.time()
request = bigquery.jobs()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t1 = time.time()
data = { 'query': (sql) }
response = request.query(projectId=project_id, body=data).execute()
t2 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("elapsed: " + str(t2 - t0))
def test_async(self, request):
job_ids = {}
t0 = time.time()
job_id = async_query(sql)
job_ids['a'] = job_id
print("job_id: " + job_id)
t1 = time.time()
job_id = async_query(sql)
job_ids['b'] = job_id
print("job_id: " + job_id)
t2 = time.time()
for key, value in job_ids.iteritems():
response = bigquery.jobs().getQueryResults(
jobId=value,
projectId=project_id).execute()
t3 = time.time()
print("0-1: " + str(t1 - t0))
print("1-2: " + str(t2 - t1))
print("2-3: " + str(t3 - t2))
print("elapsed: " + str(t3 - t0))
def async_query(sql):
job_data = {
'jobReference': {
'projectId': project_id
},
'configuration': {
'query': {
'query': sql,
'priority': 'INTERACTIVE'
}
}
}
response = bigquery.jobs().insert(
projectId=project_id,
body=job_data).execute()
job_id = response['jobReference']['jobId']
return job_id

The answer to whether running queries in parallel will speed up the results is, of course, "it depends".
When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.
The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.
Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.
So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.
My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.
There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.
That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.

How to best share static data between ipyparallel client and remote engines?

I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?

I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.

Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.

gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

Setting App Engine mapreduce shard size

Does the App Engine Mapreduce API decide compute shard size according to its own logic in the final reduce job?
I am using the App Engine mapreduce API and have supplied the shard_size
kwarg to set my mapreduce shard size.
The shard size is particularly important in my mapreduce job because I don't want to batch too many results into any one given execution of the final step of my reduce function. In other words, I'm hardcoding the shard size to evenly divide the users up according to an external constraint on the system.
The map job seems to shard out just fine, but the reducer uses only a fraction of the shards I've designated.
Here is a rough outline of the sort of code I am dealing with:
SHARD_SIZE = 42
def map_fun(entity):
shard_key = random.randint(1, SHARD_SIZE)
yield (
shard_key,
db.model_to_protobuf(entity).SerializeToString().encode('base64')
)
def reduce_fun(key, entities):
batch = []
for entity in entities:
#check for stuff
batch.append(entity)
expensive_side_effect(batch)
class MyGreatPipeline(base_handler.PipelineBase):
def run(self, *args, **kw):
yield mapreduce_pipeline.MapreducePipeline(
'label'
'path.to.map_fun',
'path.to.reduce_fun',
'mapreduce.input_readers.DatastoreInputReader',
'mapreduce.output_writers.BlobstoreOutputWriter',
mapper_params={
'entity_kind': 'path.to.entity',
'queue_name': 'coolQueue'
},
reducer_params={},
shard_size = SHARD_SIZE
)
map_fun specifically assigns each entity a shard that is determined randomly according to the shard size. I'm confused about why my reducer would have fewer shards than SHARD_SIZE given that there are many entities and it is exceedingly unlikely that the same integers were picked repeatedly.

I'm puzzling over what you're doing here. Using the map phase to group stuff onto a small, sharded key, later processing those keys at reduce time looks odd. You're going to end up with too much work to do per key, even if you do engage as many reduce workers as you do mapper workers.
The 'batch' being processing is randomly arbitrary, so I assume that expensive_side_effect() isn't dependent on the content of the batch. Why not do that work instead at map time, emitting something that a reduced could pass through to the output writer?

Is it possible to increase the response timeout in Google App Engine?

On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui

You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html

The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break

Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.