Optimizing Push Task Queues

Optimizing Push Task Queues - python

I use Google App Engine (Python) as the backend of a mobile game, which includes social network integration (twitter) and global & relative leaderboards. My application makes use of two task queues, one for building out the relationships between players, and one for updating those objects when a player's score changes.
Model
class RelativeUserScore(ndb.Model):
ID_FORMAT = "%s:%s" # "friend_id:follower_id"
#--- NDB Properties
follower_id = ndb.StringProperty(indexed=True) # the follower
user_id = ndb.StringProperty(indexed=True) # the followed (AKA friend)
points = ndb.IntegerProperty(indexed=True) # user data denormalization
screen_name = ndb.StringProperty(indexed=False) # user data denormalization
profile_image_url = ndb.StringProperty(indexed=False) # user data denormalization
This allows me to build the relative leaderboards by querying for objects where the requesting user is the follower.
Push Task Queues
I basically have two major tasks to be performed:
sync-twitter tasks will fetch the friends / followers from twitter's API, and build out the relative user score models. Friends are checked on user sign up, and again if their twitter friend count changes. Followers are only checked on user sign up. This runs in its own module with F4 instances, and min_idle_instances set to 20 (I'd like to reduce both settings if possible, though the instance memory usage requires at least F2 instances).
- name: sync-twitter
target: sync-twitter # target version / module
bucket_size: 100 # default is 5, max is 100?
max_concurrent_requests: 1000 # default is 1000. what is the max?
rate: 100/s # default is 5/s. what is the max?
retry_parameters:
min_backoff_seconds: 30
max_backoff_seconds: 900
update-leaderboard tasks will update all the user's objects after they play a game (which only takes about 2 minutes to do). This runs in its own module with F2 instances, and min_idle_instances set to 10 (I'd like to reduce both settings if possible).
- name: update-leaderboard
target: update-leaderboard # target version / module
bucket_size: 100 # default is 5, max is 100?
max_concurrent_requests: 1000 # default is 1000. what is the max?
rate: 100/s # default is 5/s. what is the max?
I've already optimized these tasks to make them run asynchronously, and have reduced their run time significantly. Most of the time, the tasks take between .5 to 5 seconds. I've also put both task queues on their own dedicated module, and have automatic scaling turned up pretty high (and are using F4 and F2 server types respectively) However, I'm still running into a few issues.
As you can also see I've tried to max out the bucket_size and max_concurrent_requests, so that these tasks run as fast as possible.
Problems
Every once in a while I get a DeadlineExceededError on the request handler that initiates the call. DeadlineExceededErrors: The API call taskqueue.BulkAdd() took too long to respond and was cancelled.
Every once in a while I get a chunk of similar errors within the tasks themselves (for both task types): "Process terminated because the request deadline was exceeded during a loading request". (Note that this isn't listed as a DeadlineExceededError). The logs show these tasks took up the entire 600 seconds allowed. They end up getting rescheduled, and when they re-run, they only take the expected .5 to 5 seconds. I've tried using AppStats to gain more insight into whats going on, but these calls never get recorded as they get killed before appstats is able to save.
With users updating their score as frequently as every two minutes, the update-leaderboard queue starts to fall behind somewhere around 10K CCU. I'd ideally like to be prepared for at least 100K CCU. (By CCU I'm meaning actual users playing our game, not number of concurrent requests, which is only about 500 front-end api requests/second at 25K users. - I use locust.io to load test)
Potential Optimizations / Questions
My first thought is maybe the first two issues deal with only having a single task queue for each of the task types. Maybe this is happening because the underlying Bigtable is splitting during these calls? (See this article, specifically "Queue Sharding for Stable Performance")
So, maybe sharding each queue into 10 different queues. I'd think problem #3 would also benefit from this queue sharding. So...
1. Any idea as to the underlying causes of problems #1 and #2? Would sharding actually help eliminate these errors?
2. If I do queue sharding, could I keep all the queues on their same respective module, and rely on its autoscaling to meet the demand? Or would I be better off with module per shard?
3. Any way to dynamically scale the sharding?
My next thought is to try and reduce the calls to update-leaderboard tasks. Something where not every complete game translates directly into a leaderboard update. But I'd need something where if the user only plays one game, its guaranteed to update the objects eventually. Any suggestions on implementing this reduction?
Finally, all of the modules' auto scaling parameters and the queue's parameters were set arbitrarily, trying to err on the side of maxing these out. Any advice on setting these appropriately so that I'm not spending any more resources than I need?

Related

How to autoscale Python webjob based on Service Bus Queues in Azure?

When a service bus queue contains any messages, I want my python webjob to scale out so the messages are processed faster.
I have a python webjob which feeds off a Service Bus Queue. The queue is populated each day at midnight and can have between 0 and around 400k messages added to it.
The bottleneck in the current processing is where some data needs to be downloaded, which means that scaling up the webjob won't help as much as parallelising it.
I scaled it up to 10 instances from 1 but that doesn't appear to affect the rate at which messages are consumed from the queue, which suggests that this isn't working the way I expect. When it was on 1 instance it processed ~1.53k in an hour. The hour since scaling out to 10 instances it processed ~1.5k messages (so basically, no difference.)
The code I'm using to interface with the queue is this (if there is a better way of doing this in Python please let me know!):
from azure.servicebus import ServiceBusService, Message, Queue
bus_service = ServiceBusService(
service_namespace= <namespace>,
shared_access_key_name='RootManageSharedAccessKey',
shared_access_key_value=<key>)
while(1):
msg = bus_service.receive_queue_message(<queue>, peek_lock=False, timeout=1)
if msg.body is None:
print("No messages in queue")
time.sleep(5)
else:
number = int(msg.body.decode('utf-8'))
print(number)
I know in C# there is a QueueTrigger attribute for webjobs but I don't know of anything similar for Python.
I would expect that the more instances running in the app service, the faster messages would be processed, so why isn't that what I see?

The bottleneck in the program was the database, which was at maximum. Adding more instances just increased the number of calls on the database and therefore slowed down each instance.
Scaling up the database and optimising the database calls improved performance and also now means that multiple instances can be spun up to further increase throughput.

GAE: how to quantify Frontend Instance Hours usage?

We are developing a Python server on Google App Engine that should be capable of handling incoming HTTP POST requests (around 1,000 to 3,000 per minute in total). Each of the requests will trigger some datastore writing operations. In addition we will write a web-client as a human-usable interface for displaying and analyse stored data.
First we are trying to estimate usage for GAE to have at least an approximation about the costs we would have to cover in future based on the number of requests. As for datastore write operations and data storage size it is fairly easy to come up with an approximate number, though it is not so obvious for the frontend and backend instance hours.
As far as I understood each time a request is coming in, an instance is being started which then is running for 15 minutes. If a request is coming in within these 15 minutes, the same instance would have been used. And now it is getting a bit tricky I think: if two requests are coming in at the very same time (which is not so odd with 3,000 requests per minute), is Google firing up another instance, hence Google would count an addition of (at least) 0.15 instance hours? Also I am not quite sure how a web-client that is constantly performing read operations on the datastore in order to display and analyse data would increase the instance hours.
Does anyone know a reliable way of counting instance hours and creating meaningful estimations? We would use that information to know how expensive it would be to run an application on GAE in comparison to just ordering a web server.

There's no 100% sure way to assess the number of frontend instance hours. An instance can serve more than one request at a time. In addition, the algorithm of the scheduler (the system that starts the instances) is not documented by Google.
Depending on how demanding your code is, I think you can expect a standard F1 instance to hold up to 5 requests in parallel, that's a maximum. 2 is a safer bet.
My recommendation, if possible, would be to simulate standard interaction on your website with limited number of users, and see how the number of instances grow, then extrapolate.
For example, let's say you simulate 100 requests per minute during 2 hours, and you see that GAE spawns 5 instances for that, then you can extrapolate that a continuous load of 3000 requests per minute would require 150 instances during the same 2 hours. Then I would double this number for safety, and end up with an estimate of 300 instances.

ndb.query.count() failed with 60s query deadline on large entities

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.
How can I count large entities?

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.
Here is an example of how you could use the deferred library in your app:
import logging
def count_large_query(query):
total = query.count()
logging.info('Total entities: %d' % total)
Then you can call the above function from within your app like:
from google.appengine.ext import deferred
# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())
While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):
LIMIT = 1024
def count_large_query(query):
cursor = None
more = True
total = 0
while more:
ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
total += len(ndbs)
logging.info('Total entitites: %d' % total)
To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.
As Guido mentioned in the comment this will not going to scale either:
This still doesn't scale (though it may postpone the problem). A task
has a 10 minute instead of 1 minute, so maybe you can count 10x as
many entities. But it's pretty expensive! Have a search for sharded
counters if you want to solve this properly (unfortunately it's a lot
of work).
So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.

This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.
It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.
Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).

It is better to use google app engine backends.
Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely.
Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview

How to minimize datastore writes initiated by the mapreduce library?

I've got 3 parts to this question:
I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.
Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)
Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?
To get around it, I'm checking the datastore before I kick off the mapreduce:
def cronhandler():
count = model.all(keys_only=True).count(limit=1000)
if count:
shards = (count / 100) + 1;
from mapreduce import control
control.start_map("Timeout open objects",
"expire.maphandler",
"expire.OpenOrderInputReader",
{'entity_kind' : 'model'},
shard_count=shards)
return HttpResponse()
Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader
Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?

What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.

This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?
Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.
For example:
import datetime
now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)
Or if you don't want to use an inequality filter you could use == based on five minute blocks.
Then, you run your cron every hour or so to clean out the inactive models.
The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.

I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.

How to schedule a periodic Celery task per Django model instance?

I have a bunch of Feed objects in my database, and I'm trying to get each Feed to be updated every hour. My issue here is that I need to make sure there aren't any duplicate updates -- it needs to happen no more than once an hour, but I also don't want feeds waiting two hours for an update. (It's okay if it happens every hour +/- a few minutes, but twice in a few minutes is bad.)
I'm using Django and Celery with Amazon SQS as a broker. I have the feed update code set up as a Celery task, but I'm failing to find a way to prevent duplicates while remaining compatible with Celery running on multiple nodes.
My current solution is to add a last_update_scheduled attribute to the Feed model and run the following task every 5 minutes (pseudo-code):
threshold = datetime.now() - timedelta(seconds=3600)
for f in Feed.objects.filter(Q(last_update_scheduled__lt = threshold) |
Q(last_update_scheduled = None)):
updateFeed.delay(f)
f.last_update_scheduled = now
f.save()
This is susceptible to a number of synchronization issues. For example, if my task queues get backed up, this task could run twice at the same time, causing duplicate updates. I've seen some solutions for this (like Celery's recipe and an adaptation on Stack Overflow), but the memcached solution isn't reliable, e.g. duplicates could happen when restarting memcached or if it happens to run out of memory and purge old data. Not to mention I'd hate to have to add memcached to my production configuration just for a simple lock.
In a perfect world, I'd like to be able to say:
#modelTask(Feed, run_every=3600)
def updateFeed(feed):
# do something expensive
But so far my imagination fails me on how to implement that decorator.

To be clear, the Celery recipe is not using memcached per se, but rather Django's caching middleware. There are a number of other caching methods that would suit your needs without the downside of memcached. See the Django caching documentation for details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.