appengine-mapreduce on Datastore: memory problems

appengine-mapreduce on Datastore: memory problems - python

I am currently working with two Datastore entities, Filing and Document. A Filing can have multiple Documents:
Filing - 8 million entities
Document - 35 million entities, they might be quite big, but always under 1 Mb.
Currently our users can lookup Filings but while adding the data I noticed that there was a problem with some of the Document entities - the KeyProperty on the Filings is missing (problems with the parser).
Since my documents have an individual ID that is of the format FilingID_documentNumber I decided to use appengine-mapreduce to add the missing KeyProperty pointing to the Filing so I can get all Documents for a given Filing.
So I created the following MapReduce job:
#ndb.toplevel
def map_callback(ed):
gc.collect()
try:
if(ed.filing is None):
ed.filing = ndb.Key("Filing", ed.key.id().split("_")[0])
yield op.db.Put(ed)
yield op.counters.Increment("n_documents_fixed")
except:
yield op.counters.Increment("saving_error")
class FixDocs(base_handler.PipelineBase):
def run(self, *args, **kwargs):
""" run """
mapper_params = {
"input_reader": {
"entity_kind": "datamodels.Document",
"batch_size": 10, #default is 50
}
}
yield mapreduce_pipeline.MapperPipeline(
"Fix Documents",
handler_spec="search.mappers.fix_documents.map_callback",
input_reader_spec="mapreduce.input_readers.DatastoreInputReader",
params=mapper_params,
shards=128)
My problem is that I am currently not able to run this mapper as I am running in a lot of memory errors. In the logs I notice a lot of the shards getting the following error:
Exceeded soft private memory limit of 128 MB with 154 MB after
servicing 2 requests total While handling this request, the process
that handled this request was found to be using too much memory and
was terminated. This is likely to cause a new process to be used for
the next request to your application. If you see this message
frequently, you may have a memory leak in your application.
I tried:
to add #ndb.toplevel on the mapper
to reduce the batch_size as the default value is 50 and I thought that maybe the documents could get considerably big (one of their fields is a BlobProperty(compressed=True, indexed=False)
But neither modification seemed to improve the execution of the job.
Does anyone have an idea of how this could be improved? Also I read that Dataflow might be suited for this, does anyone have an experience using Dataflow to update Datastore?

The error message indicates that your app uses either an F1 (default for automatic scaling) or a B1 instance class, which have a 128M memory limit.
One thing you could try would be to configure an instance class with more memory (which also happens to be faster as well) in your app.yaml file. See also the instance_class row in the Syntax table.
Side note: when I bumped my instance class higher for the same reason I also noticed that the gc.collect() calls I had in my code started to be visibly effective. Not entirely sure why, I suspect because of the faster instance and the higher memory limit gave it enough time to kick in before the instance being killed. This should help as well.

Related

Avoiding or handling "BadRequestError: The requested query has expired."?

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.

You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis

When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.

Chaining or chording very large job groups/tasks with Celery & Redis

I'm working on a project to parallelize some heavy simulation jobs. Each run takes about two minutes, takes 100% of the available CPU power, and generates over 100 MB of data. In order to execute the next step of the simulation, those results need to be combined into one huge result.
Note that this will be run on performant systems (currently testing on a machine with 16 GB ram and 12 cores, but will probably upgrade to bigger HW)
I can use a celery job group to easily dispatch about 10 of these jobs, and then chain that into the concatenation step and the next simulation. (Essentially a Celery chord) However, I need to be able to run at least 20 on this machine, and eventually 40 on a beefier machine. It seems that Redis doesn't allow for large enough objects on the result backend for me to do anything more than 13. I can't find any way to change this behavior.
I am currently doing the following, and it works fine:
test_a_group = celery.group(test_a(x) for x in ['foo', 'bar'])
test_a_result = rev_group.apply_async(add_to_parent=False)
return = test_b(test_a_result.get())
What I would rather do:
return chord(test_a_group, test_b())
The second one works for small datasets, but not large ones. It gives me a non-verbose 'Celery ChordError 104: connection refused' with large data.
Test B returns very small data, essentially a pass fail, and I am only passing the group result into B, so it should work, except that I think the entire group is being appended to the result of B, in the form of parent, making it too big. I can't find out how to prevent this from happening.
The first one works great, and I would be okay, except that it complains, saying:
[2015-01-04 11:46:58,841: WARNING/Worker-6] /home/cmaclachlan/uriel-venv/local/lib/python2.7/site-packages/celery/result.py:45:
RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
In Celery 3.2 this will result in an exception being
raised instead of just being a warning.
warnings.warn(RuntimeWarning(E_WOULDBLOCK))
What the link essentially suggests is what I want to do, but can't.
I think I read somewhere that Redis has a limit of 500 mb on size of data pushed to it.
Any advice on this hairiest of problems?

Celery isn't really designed to address this problem directly. Generally speaking, you want to keep the inputs/outputs of tasks small.
Every input or output has to be serialized (pickle by default) and transmitted through the broker, such as RabbitMQ or Redis. Since the broker needs to queue the messages when there are no clients available to handle them, you end up potentially paying the hit of writing/reading the data to disk anyway (at least for RabbitMQ).
Typically, people store large data outside of celery and just access it within the tasks by URI, ID, or something else unique. Common solutions are to use a shared network file system (as already mentioned), a database, memcached, or cloud storage like S3.
You definitely should not call .get() within a task because it can lead to deadlock.

How to import/sync data to App Engine datastore without excessive datastore reads or timeouts

I am writing an application that uses a remote API that serves up a fairly static data (but still can update several times a day). The problem is that the API is quite slow, and I'd much rather import that data into my own datastore anyway, so that I can actually query the data on my end as well.
The problem is that the results contain ~700 records that need to be sync'd every 5 hours or so. This involves adding new records, updating old records and deleting stale ones.
I have a simple solution that works -- but it's slow as molasses, and uses 30,000 datastore read operations before it times out (after about 500 records).
The worst part about this is that the 700 records are for a single client, and I was doing it as a test. In reality, I would want to do the same thing for hundreds or thousands of clients with a similar number of records... you can see how that is not going to scale.
Here is my entity class definition:
class Group(ndb.Model):
groupid = ndb.StringProperty(required=True)
name = ndb.StringProperty(required=True)
date_created = ndb.DateTimeProperty(required=True, auto_now_add=True)
last_updated = ndb.DateTimeProperty(required=True, auto_now=True)
Here is my sync code (Python):
currentTime = datetime.now()
groups = get_list_of_groups_from_api(clientid) #[{'groupname':'Group Name','id':'12341235'}, ...]
for group in groups:
groupid = group["id"]
groupObj = Group.get_or_insert(groupid, groupid=group["id"], name=group["name"])
groupObj.put()
staleGroups = Group.query(Group.last_updated < currentTime)
for staleGroup in staleGroups:
staleGroup.delete()

I can't tell you why you are getting 30,000 read operations.
You should start by running appstats and profiling this code, to see where the datastore operations are being performed.
That being said I can see some real inefficiencies in your code.
For instance your delete stale groups code is horribly inefficient.
You should be doing a keys_only query, and then doing batch deletes.
What you are doing is really slow with lots of latency for each delete() in the loop.
Also get_or_insert uses a transaction (also if the group didn't exist a put is already done, and then you do a second put()) , and if you don't need transactions you will find things will run faster. The fact that you are not storing any additional data means you could just blind write the groups (So initial get/read), unless you want to preserve date_created.
Other ways of making this faster would be by doing batch gets/puts on the list of keys.
Then for all the entities that didn't exist, do a batch put()
Again this would be much faster than iterating over each key.
In addition you should use a TaskQueue to run this set of code, you then have a 10 min processing window.
After that further scaling can be achieved by splitting the process into two tasks. The first creates/updates the group entities. Once that completes you start the task that deletes stale groups - passing the datetime as an argument to the next task.
If you have even more entities than can be processed in this simple model then start looking at MapReduce.
But for starters just concentrate on making the job you are currently running more efficient.

How to minimize datastore writes initiated by the mapreduce library?

I've got 3 parts to this question:
I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.
Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)
Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?
To get around it, I'm checking the datastore before I kick off the mapreduce:
def cronhandler():
count = model.all(keys_only=True).count(limit=1000)
if count:
shards = (count / 100) + 1;
from mapreduce import control
control.start_map("Timeout open objects",
"expire.maphandler",
"expire.OpenOrderInputReader",
{'entity_kind' : 'model'},
shard_count=shards)
return HttpResponse()
Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader
Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?

What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.

This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?
Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.
For example:
import datetime
now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)
Or if you don't want to use an inequality filter you could use == based on five minute blocks.
Then, you run your cron every hour or so to clean out the inactive models.
The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.

I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.

Schema migration on GAE datastore

First off, this is my first post on Stack Overflow, so please forgive any newbish mis-steps. If I can be clearer in terms of how I frame my question, please let me know.
I'm running a large application on Google App Engine, and have been adding new features that are forcing me to modify old data classes and add new ones. In order to clean our database and update old entries, I've been trying to write a script that can iterate through instances of a class, make changes, and then re-save them. The problem is that Google App Engine times out when you make calls to the server that take longer than a few seconds.
I've been struggling with this problem for several weeks. The best solution that I've found is here: http://code.google.com/p/rietveld/source/browse/trunk/update_entities.py?spec=svn427&r=427
I created a version of that code for my own website, which you can see here:
def schema_migration (self, target, batch_size=1000):
last_key = None
calls = {"Affiliate": Affiliate, "IPN": IPN, "Mail": Mail, "Payment": Payment, "Promotion": Promotion}
while True:
q = calls[target].all()
if last_key:
q.filter('__key__ >', last_key)
q.order('__key__')
this_batch_size = batch_size
while True:
try:
batch = q.fetch(this_batch_size)
break
except (db.Timeout, DeadlineExceededError):
logging.warn("Query timed out, retrying")
if this_batch_size == 1:
logging.critical("Unable to update entities, aborting")
return
this_batch_size //= 2
if not batch:
break
keys = None
while not keys:
try:
keys = db.put(batch)
except db.Timeout:
logging.warn("Put timed out, retrying")
last_key = keys[-1]
print "Updated %d records" % (len(keys),)
Strangely, the code works perfectly for classes with between 100 - 1,000 instances, and the script often takes around 10 seconds. But when I try to run the code for classes in our database with more like 100K instances, the script runs for 30 seconds, and then I receive this:
"Error: Server Error
The server encountered an error and could not complete your request.
If the problem persists, please report your problem and mention this error message and the query that caused it.""
Any idea why GAE is timing out after exactly thirty seconds? What can I do to get around this problem?
Thanks you!
Keller

you are hitting the second DeadlineExceededError by the sound of it. AppEngine requests can only run for 30 seconds each. When DeadLineExceedError is raised it's your job to stop processing and tidy up as you are running out of time, the next time it is raised you cannot catch it.
You should look at using the Mapper API to split your migration into batches and run each batch using the Task Queue.

The start of your solution will be to migrate to using GAE's Task Queues. This feature will allow you to queue some more work to happen at a later time.
That won't actually solve the problem immediately, because even task queue's are limited to short timeslices. However, you can unroll your loop to process a handfull of rows in your database at a time. After completing each batch, it can check to see how long it has been running, and if it's been long enough, it can start a new task in the queue to continue where the current task will leave off.
An alternative solution is to not migrate the data. Change the implementing logic so that each entity knows whether or not it has been migrated. Newly created entities, or old entities that get updated, will take the new format. Since GAE doesn't require that entities have all the same fields, you can do this easily, where on a relational database, that wouldn't be practical.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.