Schema migration on GAE datastore

Schema migration on GAE datastore - python

First off, this is my first post on Stack Overflow, so please forgive any newbish mis-steps. If I can be clearer in terms of how I frame my question, please let me know.
I'm running a large application on Google App Engine, and have been adding new features that are forcing me to modify old data classes and add new ones. In order to clean our database and update old entries, I've been trying to write a script that can iterate through instances of a class, make changes, and then re-save them. The problem is that Google App Engine times out when you make calls to the server that take longer than a few seconds.
I've been struggling with this problem for several weeks. The best solution that I've found is here: http://code.google.com/p/rietveld/source/browse/trunk/update_entities.py?spec=svn427&r=427
I created a version of that code for my own website, which you can see here:
def schema_migration (self, target, batch_size=1000):
last_key = None
calls = {"Affiliate": Affiliate, "IPN": IPN, "Mail": Mail, "Payment": Payment, "Promotion": Promotion}
while True:
q = calls[target].all()
if last_key:
q.filter('__key__ >', last_key)
q.order('__key__')
this_batch_size = batch_size
while True:
try:
batch = q.fetch(this_batch_size)
break
except (db.Timeout, DeadlineExceededError):
logging.warn("Query timed out, retrying")
if this_batch_size == 1:
logging.critical("Unable to update entities, aborting")
return
this_batch_size //= 2
if not batch:
break
keys = None
while not keys:
try:
keys = db.put(batch)
except db.Timeout:
logging.warn("Put timed out, retrying")
last_key = keys[-1]
print "Updated %d records" % (len(keys),)
Strangely, the code works perfectly for classes with between 100 - 1,000 instances, and the script often takes around 10 seconds. But when I try to run the code for classes in our database with more like 100K instances, the script runs for 30 seconds, and then I receive this:
"Error: Server Error
The server encountered an error and could not complete your request.
If the problem persists, please report your problem and mention this error message and the query that caused it.""
Any idea why GAE is timing out after exactly thirty seconds? What can I do to get around this problem?
Thanks you!
Keller

you are hitting the second DeadlineExceededError by the sound of it. AppEngine requests can only run for 30 seconds each. When DeadLineExceedError is raised it's your job to stop processing and tidy up as you are running out of time, the next time it is raised you cannot catch it.
You should look at using the Mapper API to split your migration into batches and run each batch using the Task Queue.

The start of your solution will be to migrate to using GAE's Task Queues. This feature will allow you to queue some more work to happen at a later time.
That won't actually solve the problem immediately, because even task queue's are limited to short timeslices. However, you can unroll your loop to process a handfull of rows in your database at a time. After completing each batch, it can check to see how long it has been running, and if it's been long enough, it can start a new task in the queue to continue where the current task will leave off.
An alternative solution is to not migrate the data. Change the implementing logic so that each entity knows whether or not it has been migrated. Newly created entities, or old entities that get updated, will take the new format. Since GAE doesn't require that entities have all the same fields, you can do this easily, where on a relational database, that wouldn't be practical.

Related

appengine-mapreduce on Datastore: memory problems

I am currently working with two Datastore entities, Filing and Document. A Filing can have multiple Documents:
Filing - 8 million entities
Document - 35 million entities, they might be quite big, but always under 1 Mb.
Currently our users can lookup Filings but while adding the data I noticed that there was a problem with some of the Document entities - the KeyProperty on the Filings is missing (problems with the parser).
Since my documents have an individual ID that is of the format FilingID_documentNumber I decided to use appengine-mapreduce to add the missing KeyProperty pointing to the Filing so I can get all Documents for a given Filing.
So I created the following MapReduce job:
#ndb.toplevel
def map_callback(ed):
gc.collect()
try:
if(ed.filing is None):
ed.filing = ndb.Key("Filing", ed.key.id().split("_")[0])
yield op.db.Put(ed)
yield op.counters.Increment("n_documents_fixed")
except:
yield op.counters.Increment("saving_error")
class FixDocs(base_handler.PipelineBase):
def run(self, *args, **kwargs):
""" run """
mapper_params = {
"input_reader": {
"entity_kind": "datamodels.Document",
"batch_size": 10, #default is 50
}
}
yield mapreduce_pipeline.MapperPipeline(
"Fix Documents",
handler_spec="search.mappers.fix_documents.map_callback",
input_reader_spec="mapreduce.input_readers.DatastoreInputReader",
params=mapper_params,
shards=128)
My problem is that I am currently not able to run this mapper as I am running in a lot of memory errors. In the logs I notice a lot of the shards getting the following error:
Exceeded soft private memory limit of 128 MB with 154 MB after
servicing 2 requests total While handling this request, the process
that handled this request was found to be using too much memory and
was terminated. This is likely to cause a new process to be used for
the next request to your application. If you see this message
frequently, you may have a memory leak in your application.
I tried:
to add #ndb.toplevel on the mapper
to reduce the batch_size as the default value is 50 and I thought that maybe the documents could get considerably big (one of their fields is a BlobProperty(compressed=True, indexed=False)
But neither modification seemed to improve the execution of the job.
Does anyone have an idea of how this could be improved? Also I read that Dataflow might be suited for this, does anyone have an experience using Dataflow to update Datastore?

The error message indicates that your app uses either an F1 (default for automatic scaling) or a B1 instance class, which have a 128M memory limit.
One thing you could try would be to configure an instance class with more memory (which also happens to be faster as well) in your app.yaml file. See also the instance_class row in the Syntax table.
Side note: when I bumped my instance class higher for the same reason I also noticed that the gc.collect() calls I had in my code started to be visibly effective. Not entirely sure why, I suspect because of the faster instance and the higher memory limit gave it enough time to kick in before the instance being killed. This should help as well.

How to show a 'processing' or 'in progress' view while pyramid is running a process?

I've got a simple pyramid app up and running, most of the views are a fairly thin wrapper around an sqlite database, with forms thrown in to edit/add some information.
A couple of times a month a new chunk of data will need to be added to this system (by csv import). The data is saved in an SQL table (the whole process right till commit takes about 4 seconds).
Every time a new chunk of data is uploaded, this triggers a recalculation of other tables in the database. The recalculation process takes a fairly long time (about 21-50 seconds for a month's worth of data).
Currently I just let the browser/client sit there waiting for the process to finish, but I do foresee the calculation process taking more and more time as the system gets more usage. From a UI perspective, this obviously looks like a hung process.
What can I do to indicate to the user that:-
That the long wait is normal/expected?
How MUCH longer they should have to wait (progress bar etc.)?
Note: I'm not asking about long-polling or websockets here, as this isn't really an interactive application and based on my basic knowledge websockets/async are overkill for my purposes.
I guess a follow-on question at this point, am I doing the wrong thing running processes in my view functions? Hardly seem to see that being done in examples/tutorials around the web. Am I supposed to be using celery or similar in this situation?

You're right, doing long calculations in a view function is generally frowned upon - I mean, if it's a typical website with random visitors who are able to hung a webserver thread for a minute then it's a recipe for a DoS vulnerability. But in some situations (internal website, few users, only admin has access to the "upload csv" form) you may get away with it. In fact, I used to have maintenance scripts which ran for hours :)
The trick here is to avoid browser timeouts - at the moment your client sends the data to the server and just sits there waiting for any reply, without any idea whether their request is being processed or not. Generally, at about 60 seconds the browser (or proxy, or frontend webserver) may become impatient and close the connection. Your server process will then get an error trying writing anything to the already closed connection and crash/raise an error.
To prevent this from happening the server needs to write something to the connection periodically, so the client sees that the server is alive and won't close the connection.
"Normal" Pyramid templates are buffered - i.e. the output is not sent to the client until the whole template to generated. Because of that you need to directly use response.app_iter / response.body_file and output some data there periodically.
As an example, you can duplicate the Todo List Application in One File example from Pyramid Cookbook and replace the new_view function with the following code (which itself has been borrowed from this question):
#view_config(route_name='new', request_method='GET', renderer='new.mako')
def new_view(request):
return {}
#view_config(route_name='new', request_method='POST')
def iter_test(request):
import time
if request.POST.get('name'):
request.db.execute(
'insert into tasks (name, closed) values (?, ?)',
[request.POST['name'], 0])
request.db.commit()
def test_iter():
i = 0
while True:
i += 1
if i == 5:
yield str('<p>Done! Click here to see the results</p>')
raise StopIteration
yield str('<p>working %s...</p>' % i)
print time.time()
time.sleep(1)
return Response(app_iter=test_iter())
(of cource, this solution is not too fancy UI-wise, but you said you didn't want to mess with websockets and celery)

So is the long running process triggered by browser action? I.e., the user is uploading the CSV that gets processed and then the view is doing the processing right there? For short-ish running browser processes I've used a loading indicator via jQuery or javascript, basically popping a modal animated spinner or something while a process runs, then when it completes hiding the spinner.
But if you're getting into longer and longer processes I think you should really look at some sort of background processing that will offload it from the UI. It doesn't have to be a message based worker, but even something like the end user uploads the file and a "to be processed" entry gets set in a database. Then you could have a pyramid script scheduled periodically in the background polling the status table and running anything it finds. You can move your file processing that is in the view to a separate method, and that can be called from the command line script. Then when the processing is finished it can update the status table indicating it is finished and that feedback could be presented back to the user somewhere, and not blocking their UI the whole time.

Avoiding or handling "BadRequestError: The requested query has expired."?

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.

You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis

When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.

Task queue for deferred tasks in GAE with python

I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!

You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.

To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).

Google App Engine - design considerations about cron tasks

I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?

I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.

Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.

I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.