I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.
You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis
When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.
Related
My (CLI) app uses SQLAlchemy 1.3. One of the jobs it has to perform is to query for a large number of records (>300k), then do some calculations on those and insert new records based on the results of the processing.
The app writes its activites into a log table as well, so I can see what it is currently doing during that long running job. I have various piplines (tasks), so there is a "pipelines" table with a 1:many relationship to a "log_messages" table.
I am using the ORM style, omitting the model classes here. I think it's not relevant but let me know if I should add more details.
So the general flow is something like this:
def perform_task():
with session_scope() as session:
# get a pipeline record for our log messages to link to
pipeline = session.query(Pipelines).filter(Pipelines.name=='some_name')
# log the start of the work
pipeline.append(LogMessage( text="started work")
# query the records we are working on (>300k)
job_input_all = session.query(SomeModel).filter(SomeModel.is_of_interest = True ).all()
for job_input in job_input_all():
job_input.append(SomeOtherModel( something_calculated = _do_calculation(job_input, pipeline)))
pipeline.append(LogMessage( text="finished work")
def _do_calculation( job_input, pipeline ):
# of course this isn't the real calcualtion, just illsutrating that "something happens here"
# the real stuff is complex and takes a lot of time to compute
# and we need to write log messages from time to time
calculated_value = job_input.value * 1000
if calculated_value > 100000:
pipeline.append(LogMessage( text=f'input value {job_input.value} resulted in bad output {calculated_value}'))
If I do it that way, none of the log messages will appear until the session scope ends, which commits everything. As this job takes a long time, it is important that I get the logs updated in real time so I can see what is going on. How would I do this?
If I commit after each pipeline log message is created, I will invalidate (and force to re-query) the row result objects in job_input_all, which would be bad. Even more problematic: I can't commit logs in _do_calculation() because I don't want to commit all the calculated stuff yet.
I have worked with ORMs in other languages before but I am new to SQLA (and Python for that matter) so I'm probably missing something fundamental here. Thanks for your help!
My advice would be to not write logs in this manner. If you instead wrote logs to an ELK (Elasticsearch/Logstash/Kibana) stack you could make it independent of your session and have all the nice inbuilt log-related features that Kibana gives you out of the box, in an easy to use web-GUI.
When my app runs, I'm very frequently getting issues around the connection pooling (one is "QueuePool limit of size 5 overflow 10 reached", another is "FATAL: remaining connection slots are reserved for non-replication superuser connections").
I have a feeling that it's due to some code not closing connections properly, or other code greedily trying to open new ones when it shouldn't, but I'm using the default SQL Alchemy settings so I assume the pool connection defaults shouldn't be unreasonable. We are using the scoped_session(sessionmaker()) way of creating the session so multiple threads are supported.
So my main question is if there is a tool or way to find out where the connections are going? Short of being able to see as soon as a new one is created (that is not supposed to be created), are there any obvious anti-patterns that might result in this effect?
Pyramid is very un-opinionated and with DB connections, there seem to be two main approaches (equally supported by Pyramid it would seem). In our case, the code base when I started the job used one approach (I'll call it the "globals" approach) and we've agreed to switch to another approach that relies less on globals and more on Pythonic idioms.
About our architecture: the application comprises one repo which houses the Pyramid project and then sources a number of other git modules, each of which had their own connection setup. The "globals" way connects to the database in a very non-ORM fashion, eg.:
(in each repo's __init__ file)
def load_database:
global tables
tables['table_name'] = Table(
'table_name', metadata,
Column('column_name', String),
)
There are related globals that are frequently peppered all over the code:
def function_needing_data(field_value):
global db, tables
select = sqlalchemy.sql.select(
[tables['table_name'].c.data], tables['table_name'].c.name == field_value)
return db.execute(select)
This tables variable is latched onto within each git repo which adds some more tables definitions and somehow the global tables manages to work, providing access to all of the tables.
The approach that we've moved to (although at this time, there are parts of both approaches still in the code) is via a centralised connection, binding all of the metadata to it and then querying the db in an ORM approach:
(model)
class ModelName(MetaDataBase):
__tablename__ = "models_table_name"
... (field values)
(function requiring data)
from models.db import DBSession
from models.model_name import ModelName
def function_needing_data(field_value):
return DBSession.query(ModelName).filter(
ModelName.field_value == field_value).all()
We've largely moved the code over to the latter approach which feels right, but perhaps I'm mistaken in my intentions. I don't know if there is anything inherently good or bad in either approach but could this (one of the approaches) be part of the problem so we keep running out of connections? Is there a telltale sign that I should look out for?
It appears that Pyramid functions best (in terms of handling the connection pool) when you use the Pyramid transaction manager (pyramid_tm). This excellent article by Jon Rosebaugh provides some helpful insight into both how Pyramid apps typically set up their database connections and how they should set them up.
In my case, it was necessary to include the pyramid_tm package and then remove a few occurrences where we were manually committing session changes since pyramid_tm will automatically commit changes if it doesn't see a reason not to.
[Update]
I continued to have connection pooling issues although much fewer of them. After a lot of debugging, I found that the pyramid transaction manager (if you're using it correctly) should not be the issue at all. The issue to the other connection pooling issues I had had to do with scripts that ran via cron jobs. A script will release it's connections when it's finished, but bad code design may result in situations where the same script can be opened up and starts running while the previous one is running (causing them both to run slower, slow enough to have both running while a third instance of the script starts and so on).
This is a more language- and database-agnostic error since it stems from poor job-scripting design but it's worth keeping in mind. In my case, the script had an "&" at the end so that each instance started as a background process, waited 10 seconds, then spawned another, rather than making sure the first job started AND completed, then waited 10 seconds, then started another.
Hope this helps when debugging this very frustrating and thorny issue.
I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/
First off, this is my first post on Stack Overflow, so please forgive any newbish mis-steps. If I can be clearer in terms of how I frame my question, please let me know.
I'm running a large application on Google App Engine, and have been adding new features that are forcing me to modify old data classes and add new ones. In order to clean our database and update old entries, I've been trying to write a script that can iterate through instances of a class, make changes, and then re-save them. The problem is that Google App Engine times out when you make calls to the server that take longer than a few seconds.
I've been struggling with this problem for several weeks. The best solution that I've found is here: http://code.google.com/p/rietveld/source/browse/trunk/update_entities.py?spec=svn427&r=427
I created a version of that code for my own website, which you can see here:
def schema_migration (self, target, batch_size=1000):
last_key = None
calls = {"Affiliate": Affiliate, "IPN": IPN, "Mail": Mail, "Payment": Payment, "Promotion": Promotion}
while True:
q = calls[target].all()
if last_key:
q.filter('__key__ >', last_key)
q.order('__key__')
this_batch_size = batch_size
while True:
try:
batch = q.fetch(this_batch_size)
break
except (db.Timeout, DeadlineExceededError):
logging.warn("Query timed out, retrying")
if this_batch_size == 1:
logging.critical("Unable to update entities, aborting")
return
this_batch_size //= 2
if not batch:
break
keys = None
while not keys:
try:
keys = db.put(batch)
except db.Timeout:
logging.warn("Put timed out, retrying")
last_key = keys[-1]
print "Updated %d records" % (len(keys),)
Strangely, the code works perfectly for classes with between 100 - 1,000 instances, and the script often takes around 10 seconds. But when I try to run the code for classes in our database with more like 100K instances, the script runs for 30 seconds, and then I receive this:
"Error: Server Error
The server encountered an error and could not complete your request.
If the problem persists, please report your problem and mention this error message and the query that caused it.""
Any idea why GAE is timing out after exactly thirty seconds? What can I do to get around this problem?
Thanks you!
Keller
you are hitting the second DeadlineExceededError by the sound of it. AppEngine requests can only run for 30 seconds each. When DeadLineExceedError is raised it's your job to stop processing and tidy up as you are running out of time, the next time it is raised you cannot catch it.
You should look at using the Mapper API to split your migration into batches and run each batch using the Task Queue.
The start of your solution will be to migrate to using GAE's Task Queues. This feature will allow you to queue some more work to happen at a later time.
That won't actually solve the problem immediately, because even task queue's are limited to short timeslices. However, you can unroll your loop to process a handfull of rows in your database at a time. After completing each batch, it can check to see how long it has been running, and if it's been long enough, it can start a new task in the queue to continue where the current task will leave off.
An alternative solution is to not migrate the data. Change the implementing logic so that each entity knows whether or not it has been migrated. Newly created entities, or old entities that get updated, will take the new format. Since GAE doesn't require that entities have all the same fields, you can do this easily, where on a relational database, that wouldn't be practical.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.