How to minimize datastore writes initiated by the mapreduce library?

How to minimize datastore writes initiated by the mapreduce library? - python

I've got 3 parts to this question:
I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.
Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)
Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?
To get around it, I'm checking the datastore before I kick off the mapreduce:
def cronhandler():
count = model.all(keys_only=True).count(limit=1000)
if count:
shards = (count / 100) + 1;
from mapreduce import control
control.start_map("Timeout open objects",
"expire.maphandler",
"expire.OpenOrderInputReader",
{'entity_kind' : 'model'},
shard_count=shards)
return HttpResponse()
Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader
Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?

What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.

This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?
Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.
For example:
import datetime
now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)
Or if you don't want to use an inequality filter you could use == based on five minute blocks.
Then, you run your cron every hour or so to clean out the inactive models.
The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.

I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.

Related

Optimizing Push Task Queues

I use Google App Engine (Python) as the backend of a mobile game, which includes social network integration (twitter) and global & relative leaderboards. My application makes use of two task queues, one for building out the relationships between players, and one for updating those objects when a player's score changes.
Model
class RelativeUserScore(ndb.Model):
ID_FORMAT = "%s:%s" # "friend_id:follower_id"
#--- NDB Properties
follower_id = ndb.StringProperty(indexed=True) # the follower
user_id = ndb.StringProperty(indexed=True) # the followed (AKA friend)
points = ndb.IntegerProperty(indexed=True) # user data denormalization
screen_name = ndb.StringProperty(indexed=False) # user data denormalization
profile_image_url = ndb.StringProperty(indexed=False) # user data denormalization
This allows me to build the relative leaderboards by querying for objects where the requesting user is the follower.
Push Task Queues
I basically have two major tasks to be performed:
sync-twitter tasks will fetch the friends / followers from twitter's API, and build out the relative user score models. Friends are checked on user sign up, and again if their twitter friend count changes. Followers are only checked on user sign up. This runs in its own module with F4 instances, and min_idle_instances set to 20 (I'd like to reduce both settings if possible, though the instance memory usage requires at least F2 instances).
- name: sync-twitter
target: sync-twitter # target version / module
bucket_size: 100 # default is 5, max is 100?
max_concurrent_requests: 1000 # default is 1000. what is the max?
rate: 100/s # default is 5/s. what is the max?
retry_parameters:
min_backoff_seconds: 30
max_backoff_seconds: 900
update-leaderboard tasks will update all the user's objects after they play a game (which only takes about 2 minutes to do). This runs in its own module with F2 instances, and min_idle_instances set to 10 (I'd like to reduce both settings if possible).
- name: update-leaderboard
target: update-leaderboard # target version / module
bucket_size: 100 # default is 5, max is 100?
max_concurrent_requests: 1000 # default is 1000. what is the max?
rate: 100/s # default is 5/s. what is the max?
I've already optimized these tasks to make them run asynchronously, and have reduced their run time significantly. Most of the time, the tasks take between .5 to 5 seconds. I've also put both task queues on their own dedicated module, and have automatic scaling turned up pretty high (and are using F4 and F2 server types respectively) However, I'm still running into a few issues.
As you can also see I've tried to max out the bucket_size and max_concurrent_requests, so that these tasks run as fast as possible.
Problems
Every once in a while I get a DeadlineExceededError on the request handler that initiates the call. DeadlineExceededErrors: The API call taskqueue.BulkAdd() took too long to respond and was cancelled.
Every once in a while I get a chunk of similar errors within the tasks themselves (for both task types): "Process terminated because the request deadline was exceeded during a loading request". (Note that this isn't listed as a DeadlineExceededError). The logs show these tasks took up the entire 600 seconds allowed. They end up getting rescheduled, and when they re-run, they only take the expected .5 to 5 seconds. I've tried using AppStats to gain more insight into whats going on, but these calls never get recorded as they get killed before appstats is able to save.
With users updating their score as frequently as every two minutes, the update-leaderboard queue starts to fall behind somewhere around 10K CCU. I'd ideally like to be prepared for at least 100K CCU. (By CCU I'm meaning actual users playing our game, not number of concurrent requests, which is only about 500 front-end api requests/second at 25K users. - I use locust.io to load test)
Potential Optimizations / Questions
My first thought is maybe the first two issues deal with only having a single task queue for each of the task types. Maybe this is happening because the underlying Bigtable is splitting during these calls? (See this article, specifically "Queue Sharding for Stable Performance")
So, maybe sharding each queue into 10 different queues. I'd think problem #3 would also benefit from this queue sharding. So...
1. Any idea as to the underlying causes of problems #1 and #2? Would sharding actually help eliminate these errors?
2. If I do queue sharding, could I keep all the queues on their same respective module, and rely on its autoscaling to meet the demand? Or would I be better off with module per shard?
3. Any way to dynamically scale the sharding?
My next thought is to try and reduce the calls to update-leaderboard tasks. Something where not every complete game translates directly into a leaderboard update. But I'd need something where if the user only plays one game, its guaranteed to update the objects eventually. Any suggestions on implementing this reduction?
Finally, all of the modules' auto scaling parameters and the queue's parameters were set arbitrarily, trying to err on the side of maxing these out. Any advice on setting these appropriately so that I'm not spending any more resources than I need?

How to import/sync data to App Engine datastore without excessive datastore reads or timeouts

I am writing an application that uses a remote API that serves up a fairly static data (but still can update several times a day). The problem is that the API is quite slow, and I'd much rather import that data into my own datastore anyway, so that I can actually query the data on my end as well.
The problem is that the results contain ~700 records that need to be sync'd every 5 hours or so. This involves adding new records, updating old records and deleting stale ones.
I have a simple solution that works -- but it's slow as molasses, and uses 30,000 datastore read operations before it times out (after about 500 records).
The worst part about this is that the 700 records are for a single client, and I was doing it as a test. In reality, I would want to do the same thing for hundreds or thousands of clients with a similar number of records... you can see how that is not going to scale.
Here is my entity class definition:
class Group(ndb.Model):
groupid = ndb.StringProperty(required=True)
name = ndb.StringProperty(required=True)
date_created = ndb.DateTimeProperty(required=True, auto_now_add=True)
last_updated = ndb.DateTimeProperty(required=True, auto_now=True)
Here is my sync code (Python):
currentTime = datetime.now()
groups = get_list_of_groups_from_api(clientid) #[{'groupname':'Group Name','id':'12341235'}, ...]
for group in groups:
groupid = group["id"]
groupObj = Group.get_or_insert(groupid, groupid=group["id"], name=group["name"])
groupObj.put()
staleGroups = Group.query(Group.last_updated < currentTime)
for staleGroup in staleGroups:
staleGroup.delete()

I can't tell you why you are getting 30,000 read operations.
You should start by running appstats and profiling this code, to see where the datastore operations are being performed.
That being said I can see some real inefficiencies in your code.
For instance your delete stale groups code is horribly inefficient.
You should be doing a keys_only query, and then doing batch deletes.
What you are doing is really slow with lots of latency for each delete() in the loop.
Also get_or_insert uses a transaction (also if the group didn't exist a put is already done, and then you do a second put()) , and if you don't need transactions you will find things will run faster. The fact that you are not storing any additional data means you could just blind write the groups (So initial get/read), unless you want to preserve date_created.
Other ways of making this faster would be by doing batch gets/puts on the list of keys.
Then for all the entities that didn't exist, do a batch put()
Again this would be much faster than iterating over each key.
In addition you should use a TaskQueue to run this set of code, you then have a 10 min processing window.
After that further scaling can be achieved by splitting the process into two tasks. The first creates/updates the group entities. Once that completes you start the task that deletes stale groups - passing the datetime as an argument to the next task.
If you have even more entities than can be processed in this simple model then start looking at MapReduce.
But for starters just concentrate on making the job you are currently running more efficient.

ndb.query.count() failed with 60s query deadline on large entities

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.
How can I count large entities?

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.
Here is an example of how you could use the deferred library in your app:
import logging
def count_large_query(query):
total = query.count()
logging.info('Total entities: %d' % total)
Then you can call the above function from within your app like:
from google.appengine.ext import deferred
# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())
While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):
LIMIT = 1024
def count_large_query(query):
cursor = None
more = True
total = 0
while more:
ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
total += len(ndbs)
logging.info('Total entitites: %d' % total)
To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.
As Guido mentioned in the comment this will not going to scale either:
This still doesn't scale (though it may postpone the problem). A task
has a 10 minute instead of 1 minute, so maybe you can count 10x as
many entities. But it's pretty expensive! Have a search for sharded
counters if you want to solve this properly (unfortunately it's a lot
of work).
So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.

This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.
It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.
Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).

It is better to use google app engine backends.
Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely.
Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview

How to schedule a periodic Celery task per Django model instance?

I have a bunch of Feed objects in my database, and I'm trying to get each Feed to be updated every hour. My issue here is that I need to make sure there aren't any duplicate updates -- it needs to happen no more than once an hour, but I also don't want feeds waiting two hours for an update. (It's okay if it happens every hour +/- a few minutes, but twice in a few minutes is bad.)
I'm using Django and Celery with Amazon SQS as a broker. I have the feed update code set up as a Celery task, but I'm failing to find a way to prevent duplicates while remaining compatible with Celery running on multiple nodes.
My current solution is to add a last_update_scheduled attribute to the Feed model and run the following task every 5 minutes (pseudo-code):
threshold = datetime.now() - timedelta(seconds=3600)
for f in Feed.objects.filter(Q(last_update_scheduled__lt = threshold) |
Q(last_update_scheduled = None)):
updateFeed.delay(f)
f.last_update_scheduled = now
f.save()
This is susceptible to a number of synchronization issues. For example, if my task queues get backed up, this task could run twice at the same time, causing duplicate updates. I've seen some solutions for this (like Celery's recipe and an adaptation on Stack Overflow), but the memcached solution isn't reliable, e.g. duplicates could happen when restarting memcached or if it happens to run out of memory and purge old data. Not to mention I'd hate to have to add memcached to my production configuration just for a simple lock.
In a perfect world, I'd like to be able to say:
#modelTask(Feed, run_every=3600)
def updateFeed(feed):
# do something expensive
But so far my imagination fails me on how to implement that decorator.

To be clear, the Celery recipe is not using memcached per se, but rather Django's caching middleware. There are a number of other caching methods that would suit your needs without the downside of memcached. See the Django caching documentation for details.

Google App Engine - design considerations about cron tasks

I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?

I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.

Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.

I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.