SQLAlchemy - how can I commit some INSERTs immediately within a transaction? - python

My (CLI) app uses SQLAlchemy 1.3. One of the jobs it has to perform is to query for a large number of records (>300k), then do some calculations on those and insert new records based on the results of the processing.
The app writes its activites into a log table as well, so I can see what it is currently doing during that long running job. I have various piplines (tasks), so there is a "pipelines" table with a 1:many relationship to a "log_messages" table.
I am using the ORM style, omitting the model classes here. I think it's not relevant but let me know if I should add more details.
So the general flow is something like this:
def perform_task():
with session_scope() as session:
# get a pipeline record for our log messages to link to
pipeline = session.query(Pipelines).filter(Pipelines.name=='some_name')
# log the start of the work
pipeline.append(LogMessage( text="started work")
# query the records we are working on (>300k)
job_input_all = session.query(SomeModel).filter(SomeModel.is_of_interest = True ).all()
for job_input in job_input_all():
job_input.append(SomeOtherModel( something_calculated = _do_calculation(job_input, pipeline)))
pipeline.append(LogMessage( text="finished work")
def _do_calculation( job_input, pipeline ):
# of course this isn't the real calcualtion, just illsutrating that "something happens here"
# the real stuff is complex and takes a lot of time to compute
# and we need to write log messages from time to time
calculated_value = job_input.value * 1000
if calculated_value > 100000:
pipeline.append(LogMessage( text=f'input value {job_input.value} resulted in bad output {calculated_value}'))
If I do it that way, none of the log messages will appear until the session scope ends, which commits everything. As this job takes a long time, it is important that I get the logs updated in real time so I can see what is going on. How would I do this?
If I commit after each pipeline log message is created, I will invalidate (and force to re-query) the row result objects in job_input_all, which would be bad. Even more problematic: I can't commit logs in _do_calculation() because I don't want to commit all the calculated stuff yet.
I have worked with ORMs in other languages before but I am new to SQLA (and Python for that matter) so I'm probably missing something fundamental here. Thanks for your help!

My advice would be to not write logs in this manner. If you instead wrote logs to an ELK (Elasticsearch/Logstash/Kibana) stack you could make it independent of your session and have all the nice inbuilt log-related features that Kibana gives you out of the box, in an easy to use web-GUI.

Related

What is the use case for Django's on_commit?

Reading this documentation https://docs.djangoproject.com/en/4.0/topics/db/transactions/#django.db.transaction.on_commit
This is the use case for on_commit
with transaction.atomic(): # Outer atomic, start a new transaction
transaction.on_commit(foo)
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
transaction.on_commit(bar)
# Do more things...
# foo() and then bar() will be called when leaving the outermost block
But why not just write the code like normal without on_commit hooks? Like this:
with transaction.atomic(): # Outer atomic, start a new transaction
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
# Do more things...
foo()
bar()
# foo() and then bar() will be called when leaving the outermost block
It's easier to read since it doesn't require more knowledge of the Django APIs and the statements are put in the order of when they are executed. It's easier to test since you don't have to use any special test classes for Django.
So what is the use-case for the on_commit hook?
The example code given in the Django docs is transaction.on_commit(lambda: some_celery_task.delay('arg1')) and it's probably specifically because this comes up a lot with celery tasks.
Imagine if you do the following within a transaction:
my_object = MyObject.objects.create()
some_celery_task.delay(my_object.pk)
Then in your celery task you try doing this:
#app.task
def some_celery_task(object_pk)
my_object = MyObject.objects.get(pk=object_pk)
This may work a lot of the time, but randomly you'll get errors where it's not able to find the object (depending on how fast the work task is run because it's a race condition). This is because you created a MyObject record within a transaction, but it isn't actually available in the database until a COMMIT is run. Celery has no access to that open transaction, so it needs to be run after the COMMIT. There's also the very real possibility that something later on causes a ROLLBACK and that celery task should never actually be called.
So... You need to do:
my_object = MyObject.objects.create()
transaction.on_commit(lambda: some_celery_task.delay(my_object.pk))
Now, the celery task won't be called until the MyObject has actually been saved to the database after the COMMIT was called.
I should note, though, this is primarily only a concern when you aren't using AUTOCOMMIT (which is actually the default). If you're in AUTOCOMMIT mode then you can be certain that a commit has been finished as part of a .create() or .save(). However, if you're code base has any possibility of being called within a #transaction.atomic() then it's no longer AUTOCOMMIT and you're back to needing .on_commit(), so it's best/safest to always use it.
Django documentation:
Django provides the on_commit() function to register callback functions that should be executed after a transaction is successfully committed
It is the main purpose. A transaction is a unit of work that you want to treat atomically. It either happens completely or not at all. The same applies to your code. If something went wrong during DB operations you might not need to do some things.
Let's consider some business logic flow:
User sends his registration data to our endpoint, we validate it, etc.
We save the new user to our DB.
We send him a "hello" letter to email with a link for confirming his account.
If something goes wrong during step 2, we shouldn't go to step 3.
We can think that, well, I'll get an exception and wouldn't execute that code as well. Why do we still need it?
Sometimes you take actions in your code based on an assumption of the transaction being successful before potentially dangerous DB operations. For example, you want firstly to check if can send an email to your user, because you know that your emailing 3rd-party often gives you 500. In that case, you want to raise a 500 for the user and ask him to register later (a very bad idea, btw, but it's just a synthetic example).
When your function (e.g. with #atomic decorator) contains a lot of DB operations you surely don't want to memorize all the variables states in order to use them after all DB-related code. Like this:
Validation of user's order.
Checking at DB if it could be completed.
If it could be done we need to send a request to 3rd-party CRM with the order's details.
If it couldn't, then we should create a support ticket in another 3rd-party.
Saving user's order to DB, updating user's model.
Sending a messenger notification to the employee who is responsible for the order.
Saving information, that notification for employee was sent successfully to the DB.
You can imagine what a mess would we have if we hadn't on_commit in this situation and we had a really big try-catch on this.

Avoiding or handling "BadRequestError: The requested query has expired."?

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.
You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis
When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.

Django design pattern for web analytics screens that take a really long time to calculate

I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.

Schema migration on GAE datastore

First off, this is my first post on Stack Overflow, so please forgive any newbish mis-steps. If I can be clearer in terms of how I frame my question, please let me know.
I'm running a large application on Google App Engine, and have been adding new features that are forcing me to modify old data classes and add new ones. In order to clean our database and update old entries, I've been trying to write a script that can iterate through instances of a class, make changes, and then re-save them. The problem is that Google App Engine times out when you make calls to the server that take longer than a few seconds.
I've been struggling with this problem for several weeks. The best solution that I've found is here: http://code.google.com/p/rietveld/source/browse/trunk/update_entities.py?spec=svn427&r=427
I created a version of that code for my own website, which you can see here:
def schema_migration (self, target, batch_size=1000):
last_key = None
calls = {"Affiliate": Affiliate, "IPN": IPN, "Mail": Mail, "Payment": Payment, "Promotion": Promotion}
while True:
q = calls[target].all()
if last_key:
q.filter('__key__ >', last_key)
q.order('__key__')
this_batch_size = batch_size
while True:
try:
batch = q.fetch(this_batch_size)
break
except (db.Timeout, DeadlineExceededError):
logging.warn("Query timed out, retrying")
if this_batch_size == 1:
logging.critical("Unable to update entities, aborting")
return
this_batch_size //= 2
if not batch:
break
keys = None
while not keys:
try:
keys = db.put(batch)
except db.Timeout:
logging.warn("Put timed out, retrying")
last_key = keys[-1]
print "Updated %d records" % (len(keys),)
Strangely, the code works perfectly for classes with between 100 - 1,000 instances, and the script often takes around 10 seconds. But when I try to run the code for classes in our database with more like 100K instances, the script runs for 30 seconds, and then I receive this:
"Error: Server Error
The server encountered an error and could not complete your request.
If the problem persists, please report your problem and mention this error message and the query that caused it.""
Any idea why GAE is timing out after exactly thirty seconds? What can I do to get around this problem?
Thanks you!
Keller
you are hitting the second DeadlineExceededError by the sound of it. AppEngine requests can only run for 30 seconds each. When DeadLineExceedError is raised it's your job to stop processing and tidy up as you are running out of time, the next time it is raised you cannot catch it.
You should look at using the Mapper API to split your migration into batches and run each batch using the Task Queue.
The start of your solution will be to migrate to using GAE's Task Queues. This feature will allow you to queue some more work to happen at a later time.
That won't actually solve the problem immediately, because even task queue's are limited to short timeslices. However, you can unroll your loop to process a handfull of rows in your database at a time. After completing each batch, it can check to see how long it has been running, and if it's been long enough, it can start a new task in the queue to continue where the current task will leave off.
An alternative solution is to not migrate the data. Change the implementing logic so that each entity knows whether or not it has been migrated. Newly created entities, or old entities that get updated, will take the new format. Since GAE doesn't require that entities have all the same fields, you can do this easily, where on a relational database, that wouldn't be practical.

Google App Engine - design considerations about cron tasks

I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.

Categories

Resources