Performance issues with App Engine memcache / ndb.get_multi

Performance issues with App Engine memcache / ndb.get_multi - python

I'm seeing very poor performance when fetching multiple keys from Memcache using ndb.get_multi() in App Engine (Python).
I am fetching ~500 small objects, all of which are in memcache. If I do this using ndb.get_multi(keys), it takes 1500ms or more. Here is typical output from App Stats:
and
As you can see, all the data is served from memcache. Most of the time is reported as being outside of RPC calls. However, my code is about as minimal as you can get, so if the time is spent on CPU it must be somewhere inside ndb:
# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get ~500 small items from memcache. This is very slow (~1500ms).
items = ndb.get_multi(item_keys)
The first memcache.get you see in App Stats is the single fetch to get a set of keys. The second memcache.get is the ndb.get_multi call.
The items I am fetching are super-simple:
class Item(ndb.Model):
name = ndb.StringProperty(indexed=False)
image_url = ndb.StringProperty(indexed=False)
image_width = ndb.IntegerProperty(indexed=False)
image_height = ndb.IntegerProperty(indexed=False)
Is this some kind of known ndb performance issue? Something to do with deserialization cost? Or is it a memcache issue?
I found that if instead of fetching 500 objects, I instead aggregate all the data into a single blob, my function runs in 20ms instead of >1500ms:
# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get individual item data.
# If we get all the data from memcache as a single blob it is very fast (~20ms).
item_data = memcache.get(items_data_key)
if not item_data:
items = ndb.get_multi(item_keys)
flat_data = json.dumps([{'name': item.name} for item in items])
memcache.add(items_data_key, flat_data)
This is interesting, but isn't really a solution for me since the set of items I need to fetch isn't static.
Is the performance I'm seeing typical/expected? All these measurements are on the default App Engine production config (F1 instance, shared memcache). Is it deserialization cost? Or due to fetching multiple keys from memcache maybe?
I don't think the issue is instance ramp-up time. I profiled the code line by line using time.clock() calls and I see roughly similar numbers (3x faster than what I see in AppStats, but still very slow). Here's a typical profile:
# Fetch keys: 20 ms
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms
Update: Out of interest I also profiled this with all the app engine performance settings increased to maximum (F4 instance, 2400Mhz, dedicated memcache). The performance wasn't much better. On the faster instance the App Stats timings now match my time.clock() profile (so 500ms to fetch 500 small objects instead of 1500ms). However, it seem seems extremely slow.

I investigated this in a bit of detail, and the problem is ndb and Python, not memcache. The reason things are so incredibly slow is partly deserialization (explains about 30% of the time), and the rest seems to be overhead in ndb's task queue implementation.
This means that, if you really want to, you can avoid ndb and instead fetch and deserialize from memcache directly. In my test case with 500 small entities, this gives a massive 2.5x speedup (650ms vs 1600ms on an F1 instance in production, or 200ms vs 500ms on an F4 instance).
This gist shows how to do it:
https://gist.github.com/mcummins/600fa8852b4741fb2bb1
Here is the appstats output for the manual memcache fetch and deserialization:
Now compare this to fetching exactly the same entities using ndb.get_multi(keys):
Almost 3x difference!!
Profiling each step is shown below. Note the timings don't match appstats because they're running on an F1 instance, so real time is 3x clock time.
Manual version:
# memcache.get_multi: 50.0 ms
# Deserialization: 140.0 ms
# Number of keys is 521, fetch time per key is 0.364683301344 ms
vs ndb version:
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms
So ndb takes 1ms per entity fetched, even if the entity has one single property and is in memcache. That's on an F4 instance. On an F1 instance it takes 3ms. This is a serious practical limitation: if you want to maintain reasonable latency, you can't fetch more than ~100 entities of any kind when handling a user request on an F1 instance.
Clearly ndb is doing something really expensive and (at least in this case) unnecessary. I think it has something to do with its task queue and all the futures it sets up. Whether it is worth going around ndb and doing things manually depends on your app. If you have some memcache misses then you will have to go do the datastore fetches. So you essentially end up partly reimplementing ndb. However, since ndb seems to have such massive overhead, this may be worth doing. At least it seems so based on my use case of a lot of get_multi calls for small objects, with a high expected memcache hit rate.
It also seems to suggest that if Google were to implement some key bits of ndb and/or deserialization as C modules, Python App Engine could be massively faster.

Related

Google Cloud Datastore latency

I have an API deployed on Cloud Run where each request results in a read + write to Cloud Datastore. A non-trivial amount of the requests are first timers (read from Datastore will return null), so adding caching in front of it may not be too helpful.
Over the past month, the average wall time around calling Datastore and having the data (data = client.get(key, eventual=True)) is 48ms. The payloads are small (a list of dicts, with 10 elements on average, and each dict has two floats).
I'm not sure if I should say that latency is high, but my API has a budget of 100ms to do all that it needs to do and return. If just the data fetching takes ~50% of that time, I'm looking for ways to optimize things.
Questions:
In general, how does 50ms sound for fairly small payloads, fetched by key, from within GCP?
What should I expect (in terms of latency) from Memorystore within GCP?

Assuming you are using Cloud Run and Datastore on the same location, I would say that 50ms is around the expected latency you'd have for reading on datastore, the size of the payload does not matter as much for reads (10 - 1000 document reads do not make a big difference on time of processing/propagating).
Since you have such a small window for your API to operate, this could indeed be a problem if some unnexpected delays occur.
I have never used Memorystore so I can't say what you could expect of actual latency but it might be a better option given that every ms to your app counts.

GAE: how to quantify Frontend Instance Hours usage?

We are developing a Python server on Google App Engine that should be capable of handling incoming HTTP POST requests (around 1,000 to 3,000 per minute in total). Each of the requests will trigger some datastore writing operations. In addition we will write a web-client as a human-usable interface for displaying and analyse stored data.
First we are trying to estimate usage for GAE to have at least an approximation about the costs we would have to cover in future based on the number of requests. As for datastore write operations and data storage size it is fairly easy to come up with an approximate number, though it is not so obvious for the frontend and backend instance hours.
As far as I understood each time a request is coming in, an instance is being started which then is running for 15 minutes. If a request is coming in within these 15 minutes, the same instance would have been used. And now it is getting a bit tricky I think: if two requests are coming in at the very same time (which is not so odd with 3,000 requests per minute), is Google firing up another instance, hence Google would count an addition of (at least) 0.15 instance hours? Also I am not quite sure how a web-client that is constantly performing read operations on the datastore in order to display and analyse data would increase the instance hours.
Does anyone know a reliable way of counting instance hours and creating meaningful estimations? We would use that information to know how expensive it would be to run an application on GAE in comparison to just ordering a web server.

There's no 100% sure way to assess the number of frontend instance hours. An instance can serve more than one request at a time. In addition, the algorithm of the scheduler (the system that starts the instances) is not documented by Google.
Depending on how demanding your code is, I think you can expect a standard F1 instance to hold up to 5 requests in parallel, that's a maximum. 2 is a safer bet.
My recommendation, if possible, would be to simulate standard interaction on your website with limited number of users, and see how the number of instances grow, then extrapolate.
For example, let's say you simulate 100 requests per minute during 2 hours, and you see that GAE spawns 5 instances for that, then you can extrapolate that a continuous load of 3000 requests per minute would require 150 instances during the same 2 hours. Then I would double this number for safety, and end up with an estimate of 300 instances.

How to import/sync data to App Engine datastore without excessive datastore reads or timeouts

I am writing an application that uses a remote API that serves up a fairly static data (but still can update several times a day). The problem is that the API is quite slow, and I'd much rather import that data into my own datastore anyway, so that I can actually query the data on my end as well.
The problem is that the results contain ~700 records that need to be sync'd every 5 hours or so. This involves adding new records, updating old records and deleting stale ones.
I have a simple solution that works -- but it's slow as molasses, and uses 30,000 datastore read operations before it times out (after about 500 records).
The worst part about this is that the 700 records are for a single client, and I was doing it as a test. In reality, I would want to do the same thing for hundreds or thousands of clients with a similar number of records... you can see how that is not going to scale.
Here is my entity class definition:
class Group(ndb.Model):
groupid = ndb.StringProperty(required=True)
name = ndb.StringProperty(required=True)
date_created = ndb.DateTimeProperty(required=True, auto_now_add=True)
last_updated = ndb.DateTimeProperty(required=True, auto_now=True)
Here is my sync code (Python):
currentTime = datetime.now()
groups = get_list_of_groups_from_api(clientid) #[{'groupname':'Group Name','id':'12341235'}, ...]
for group in groups:
groupid = group["id"]
groupObj = Group.get_or_insert(groupid, groupid=group["id"], name=group["name"])
groupObj.put()
staleGroups = Group.query(Group.last_updated < currentTime)
for staleGroup in staleGroups:
staleGroup.delete()

I can't tell you why you are getting 30,000 read operations.
You should start by running appstats and profiling this code, to see where the datastore operations are being performed.
That being said I can see some real inefficiencies in your code.
For instance your delete stale groups code is horribly inefficient.
You should be doing a keys_only query, and then doing batch deletes.
What you are doing is really slow with lots of latency for each delete() in the loop.
Also get_or_insert uses a transaction (also if the group didn't exist a put is already done, and then you do a second put()) , and if you don't need transactions you will find things will run faster. The fact that you are not storing any additional data means you could just blind write the groups (So initial get/read), unless you want to preserve date_created.
Other ways of making this faster would be by doing batch gets/puts on the list of keys.
Then for all the entities that didn't exist, do a batch put()
Again this would be much faster than iterating over each key.
In addition you should use a TaskQueue to run this set of code, you then have a 10 min processing window.
After that further scaling can be achieved by splitting the process into two tasks. The first creates/updates the group entities. Once that completes you start the task that deletes stale groups - passing the datetime as an argument to the next task.
If you have even more entities than can be processed in this simple model then start looking at MapReduce.
But for starters just concentrate on making the job you are currently running more efficient.

ndb.query.count() failed with 60s query deadline on large entities

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.
How can I count large entities?

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.
Here is an example of how you could use the deferred library in your app:
import logging
def count_large_query(query):
total = query.count()
logging.info('Total entities: %d' % total)
Then you can call the above function from within your app like:
from google.appengine.ext import deferred
# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())
While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):
LIMIT = 1024
def count_large_query(query):
cursor = None
more = True
total = 0
while more:
ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
total += len(ndbs)
logging.info('Total entitites: %d' % total)
To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.
As Guido mentioned in the comment this will not going to scale either:
This still doesn't scale (though it may postpone the problem). A task
has a 10 minute instead of 1 minute, so maybe you can count 10x as
many entities. But it's pretty expensive! Have a search for sharded
counters if you want to solve this properly (unfortunately it's a lot
of work).
So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.

This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.
It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.
Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).

It is better to use google app engine backends.
Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely.
Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview

How to minimize datastore writes initiated by the mapreduce library?

I've got 3 parts to this question:
I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.
Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)
Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?
To get around it, I'm checking the datastore before I kick off the mapreduce:
def cronhandler():
count = model.all(keys_only=True).count(limit=1000)
if count:
shards = (count / 100) + 1;
from mapreduce import control
control.start_map("Timeout open objects",
"expire.maphandler",
"expire.OpenOrderInputReader",
{'entity_kind' : 'model'},
shard_count=shards)
return HttpResponse()
Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader
Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?

What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.

This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?
Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.
For example:
import datetime
now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)
Or if you don't want to use an inequality filter you could use == based on five minute blocks.
Then, you run your cron every hour or so to clean out the inactive models.
The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.

I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.