Django: updating many objects with per-object calculation

Django: updating many objects with per-object calculation - python

This question is a continuation of one I asked yesterday: I'm still not sure if a post_save handler or a 2nd Celery task is the best way to update many objects based on the results of the first Celery task, but I plan to test performance down the line. Here's a recap of what's happening:
Celery task, every 30s:
Update page_count field of Book object based on conditions
|
post_save(Book) |
V
Update some field on all Reader objects w/ foreign key to updated Book
(update will have different results per-Reader, thousands of Readers could be FKed to Book)
The first task could save ~10 objects, requiring the update to all related Reader objects for each.
Whichever proves to be better between post_save and another task, they must accomplish the same thing: update potentially tens to hundreds of thousands of objects in a table, with each object update being unique. It could be that my choice between post_save and Celery task is determined by which method will actually allow me to accomplish this goal.
Since I can't just use a few queryset update() commands, I need to somehow call a method or function that calculates the value of a field based on the result of the first Celery task as well as some of the values in the object. Here's an example:
class Reader(models.Model):
book = models.ForeignKey(Book)
pages_read = models.IntegerField(default=0)
book_finished = models.BooleanField(default=False)
def determine_book_finished(self):
if self.pages_read == book.page_count:
self.book_finished = True
else:
self.book_finished = False
This is a contrived example, but if the page_count was updated in the first task, I want all Readers foreign keyed to the Book to have their book_finished recalculated- and looping over a queryset seems like a really inefficient way to go about it.
My thought was to somehow call a model method such as determine_book_finished() on an entire queryset at once, but I can't find any documentation on how to do something like that- custom querysets don't appear to be intended for actually operating on objects in the queryset beyond the built-in update() capability.
This post using Celery is the most promising thing I've found, and since Django signals are sync, using another Celery task would also have the benefit of not holding anything else up. So even though I'd still need to loop over a queryset, it'd be async and any querysets that needed to be updated could be handled by separate tasks, hopefully in parallel.
On the other had, this question seems to have a solution too- register the method with the post_save signal, which presumably would run the method on all objects after receiving the signal. Would this be workable with thousands of objects needing update, as well as potentially other Books being updated by the same task and their thousands of associated Readers then needing update too?
Is there a best practice for doing what I'm trying to do here?
EDIT: I realized I could go about this another way- making the book_finished field a property determined at runtime rather than a static field.
#property
def book_finished:
if self.pages_read == self.book.page_count:
if self.book.page_count == self.book.planned_pages:
return True
else:
return False
This is close enough to my actual code- in that, the first if branch contains a couple elif branches, with each having their own if-else for a total maximum depth of 3 ifs.
Until I can spin up a lot of test data and simulate many simultaneous users, I may stick with this option as it definitely works (for now). I don't really like having the property calculated every retrieval, but from some quick research, it doesn't seem like an overly slow method.

Related

Overriding delete() vs using pre delete signal

I have a model where i would like when an object gets deleted, instead of being deleted a status is updated. This was achieved with following code:
def delete(self, using=None, keep_parents=False):
self.status = Booking.DELETED
self.save()
The manager was updated so that in the rest of the application i never get presented deleted bookings.
class BookingManager(models.Manager):
def get_queryset(self):
return super().get_queryset().exclude(status=Booking.DELETED)
class BookingDeletedManager(models.Manager):
def get_queryset(self):
return super().get_queryset().filter(status=Booking.DELETED)
class Booking(models.Model):
PAYED = 0
PENDING = 1
OPEN = 2
CANCELLED = 3
DELETED = 4
objects = BookingManager()
deleted_objects = BookingDeletedManager()
...
Now i have read up on django signals and was wondering wheter it would be better to use the pre delete signal here. The code could be changed so that in the pre delete receiver a duplicate is created with status 'deleted' and the original booking is just deleted.
In the documentation it states that these signals should be used to allow decoupled applications get notified when actions occur elsewhere in the framework. It seems the signal is a good a solution but this nuance in the documentation makes me think it's maybe not what i want and overriding might just be the way to go.
This is not really the case here since i just want this functionality all the time. So my question being is there a solid reason why i should not override the delete method and use the pre delete signal or vice versa?

In the documentation it states that these signals should be used to allow decoupled applications get notified when actions occur elsewhere in the framework.
That's indeed the point: allowing one application to get notified of events occuring in another application that knows nothing about the first one - hence avoiding the need to couple the second application to the first one.
In your case using models signals instead of just overridding the model's method would be an anti-pattern: it will only add overhead and make your code less readable for absolutely no good reason, when the very obvious solution is to just do what you've did.

How efficient is passing an object over doing a get?

I have a view function which does a get on objects(say A,B & C) using their Ids.
This view function calls a local function.
Should I be passing the objects to the local function or should I pass the Ids and do a get again there? Which is more efficient?
Which is a bigger overhead, passing an object or retreiving an object using get?

I'm not sure why you would think there is any overhead in passing an object into a function. That will always be cheaper than querying it from the database again, which would mean constructing the query, calling the database, and instantiating something from the result.
The only time you would definitely need to pass IDs rather than the object is in an asynchronous context like a Celery task; there, you want to be sure that you get the most recent version of the object which might have been changed in the DB by the time the task is processed.

Passing the python object has no overhead, while passing the id repeating the get() is going to incurr a database lookup which is much more costly.
However it depends on the particular case which one should be preferred. Consider a multiuser case in which two users modify the same object:
from django.db import models
class Movie(models.Model):
coolness = models.IntegerField()
times_modified = models.IntegerField()
A view function might call some heavy computation to calculate coolness from a request:
def update_coolness(delta_coolness, id):
movie = Movie.objects.get(pk=id) # Read object from database
movie.times_modified += 1
movie.save()
time.sleep(5) # Heavy computation
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()
If multiple users execute this query at the same time, the results are likely to be because of traditional parallel computing problems.
You can avoid saving the wrong coolness value by repeating the get request, or simply refreshing the object:
time.sleep(5) # Heavy computation
# movie = Movies.objects.get(pk=id) # make sure we have the newest version
movie.refresh_from_db() # make sure we have the newest version
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()
Which massively reduces the timeslot in which other users could interfere with your computation. However the clean solution is to use transaction.atomic which will lock the item during the process:
from django.db import transaction
def update_coolness(delta_coolness,id):
with transaction.atomic():
movie = Movie.objects.get(pk=id) # Read object from database
movie.times_modified+=1
movie.save()
time.sleep(5) # Heavy computation
with transaction.atomic():
movie.refresh_from_db() # make sure we have the newest version
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()

Passing arguments around in this way is quite cheap: under the hood, it is implemented in terms of a single additional pointer. This will almost certainly be faster than invoking the django machinery again, which for a lookup by ID has to involve a (still fast, but relatively slower) dictionary lookup at minimum, or if it doesn't do caching, could involve requerying the database (which is going to be noticeably slow, especially if the database is big).
Prefer passing local variables around where possible unless there is a benefit to code clarity from doing it otherwise (but I can't think of any cases where local variables wouldn't be the clearer option), or if the "outside world" captured by that object might have changed in ways you need to be aware of.

Django long running task - Database consistancy

I have a task running in a thread that will save its input-data after quite some while. Django typically saves the whole object which can have changed in the meantime. I also don't want to run the transaction that long since it will fail or block other tasks. My solution is to reload the data and save the result. Is this the way to go or is there some optimistic locking scheme, partial save or something else I should use?
My solution:
with transaction.atomic():
obj = mod.TheModel.objects.get(id=the_id)
# Work the task
with transaction.atomic():
obj = mod.TheObject.objects.get(id=obj.id)
obj.result = result
obj.save()

Generally, if you don't want to block other tasks doing long operations, you want these operations to be asynchronous.
There exists libraries to do this kind of tasks with Django. The most famous is probably http://www.celeryproject.org/

I would recommend using a partial save via Django QuerySet's update method. There's an update_fields keyword argument in the save method on instances which will limit the fields to be saved. However any logic within the save method itself might rely on other data being up to date. There's a relatively new instance update_from_db method for that purpose. However if your save method isn't overridden then both with produce exactly the same SQL anyway. With update avoiding any potential issues with data integrity.
Example:
num_changed = mod.TheObject.objects.filter(id=obj.id).update(result=result)
if num_changed == 0:
# Handle as you would have handled DoesNotExist from your get call

How to fetch Riak object, change its value and store it back with all indexes in Python

I am using Riak database to store my Python application objects that are used and processed in parallel by multiple scripts. Because of that, I need to lock them in various places, to avoid being processed by more than one script at once, like that:
riak_bucket = riak_connect('clusters')
cluster = riak_bucket.get(job_uuid).get_data()
cluster['status'] = 'locked'
riak_obj = riak_bucket.new(job_uuid, data=cluster)
riak_obj.add_index('username_bin', cluster['username'])
riak_obj.add_index('hostname_bin', cluster['hostname'])
riak_obj.store()
The thing is, this is quite a bit of code to do one simple, repeatable thing, and given the fact locking occurs quite often, I would like to find a simpler, cleaner way of doing that. I've tried to write a function to do locking/unlocking, like that (for a different object, called 'build'):
def build_job_locker(uuid, status='locked'):
riak_bucket = riak_connect('builds')
build = riak_bucket.get(uuid).get_data()
build['status'] = status
riak_obj = riak_bucket.new(build['uuid'], data=build)
riak_obj.add_index('cluster_uuid_bin', build['cluster_uuid'])
riak_obj.add_index('username_bin', build['username'])
riak_obj.store()
# when locking, return the locked db object to avoid fetching it again
if 'locked' in status:
return build
else:
return
but since the objects are obviously quite different one from another, they've different indexes and so on, I ended up writing a locking function per every object... which is almost as much messy as not having the functions at all and repeating the code.
The question is: is there a way to write a general function to do so, knowing that every object has a 'status' field, that'd lock them in db retaining all indexes and other attributes? Or, perhaps, there is another, easier way I havent thought about?

After some more research, and questions asked on various IRC channels it seems that this is not doable, as there's no way to fetch this kind of metadata about objects from Riak.

Per-session transactions in Django

I'm making a Django web-app which allows a user to build up a set of changes over a series of GETs/POSTs before committing them to the database (or reverting) with a final POST. I have to keep the updates isolated from any concurrent database users until they are confirmed (this is a configuration front-end), ruling out committing after each POST.
My preferred solution is to use a per-session transaction. This keeps all the problems of remembering what's changed (and how it affects subsequent queries), together with implementing commit/rollback, in the database where it belongs. Deadlock and long-held locks are not an issue, as due to external constraints there can only be one user configuring the system at any one time, and they are well-behaved.
However, I cannot find documentation on setting up Django's ORM to use this sort of transaction model. I have thrown together a minimal monkey-patch (ew!) to solve the problem, but dislike such a fragile solution. Has anyone else done this before? Have I missed some documentation somewhere?
(My version of Django is 1.0.2 Final, and I am using an Oracle database.)

Multiple, concurrent, session-scale transactions will generally lead to deadlocks or worse (worse == livelock, long delays while locks are held by another session.)
This design is not the best policy, which is why Django discourages it.
The better solution is the following.
Design a Memento class that records the user's change. This could be a saved copy of their form input. You may need to record additional information if the state changes are complex. Otherwise, a copy of the form input may be enough.
Accumulate the sequence of Memento objects in their session. Note that each step in the transaction will involve fetches from the data and validation to see if the chain of mementos will still "work". Sometimes they won't work because someone else changed something in this chain of mementos. What now?
When you present the 'ready to commit?' page, you've replayed the sequence of Mementos and are pretty sure they'll work. When the submit "Commit", you have to replay the Mementos one last time, hoping they're still going to work. If they do, great. If they don't, someone changed something, and you're back at step 2: what now?
This seems complex.
Yes, it does. However it does not hold any locks, allowing blistering speed and little opportunity for deadlock. The transaction is confined to the "Commit" view function which actually applies the sequence of Mementos to the database, saves the results, and does a final commit to end the transaction.
The alternative -- holding locks while the user steps out for a quick cup of coffee on step n-1 out of n -- is unworkable.
For more information on Memento, see this.

In case anyone else ever has the exact same problem as me (I hope not), here is my monkeypatch. It's fragile and ugly, and changes private methods, but thankfully it's small. Please don't use it unless you really have to. As mentioned by others, any application using it effectively prevents multiple users doing updates at the same time, on penalty of deadlock. (In my application, there may be many readers, but multiple concurrent updates are deliberately excluded.)
I have a "user" object which persists across a user session, and contains a persistent connection object. When I validate a particular HTTP interaction is part of a session, I also store the user object on django.db.connection, which is thread-local.
def monkeyPatchDjangoDBConnection():
import django.db
def validConnection():
if django.db.connection.connection is None:
django.db.connection.connection = django.db.connection.user.connection
return True
def close():
django.db.connection.connection = None
django.db.connection._valid_connection = validConnection
django.db.connection.close = close
monkeyPatchDBConnection()
def setUserOnThisThread(user):
import django.db
django.db.connection.user = user
This last is called automatically at the start of any method annotated with #login_required, so 99% of my code is insulated from the specifics of this hack.

I came up with something similar to the Memento pattern, but different enough that I think it bears posting. When a user starts an editing session, I duplicate the target object to a temporary object in the database. All subsequent editing operations affect the duplicate. Instead of saving the object state in a memento at each change, I store operation objects. When I apply an operation to an object, it returns the inverse operation, which I store.
Saving operations is much cheaper for me than mementos, since the operations can be described with a few small data items, while the object being edited is much bigger. Also I apply the operations as I go and save the undos, so that the temporary in the db always corresponds to the version in the user's browser. I never have to replay a collection of changes; the temporary is always only one operation away from the next version.
To implement "undo," I pop the last undo object off the stack (as it were--by retrieving the latest operation for the temporary object from the db) apply it to the temporary and return the transformed temporary. I could also push the resultant operation onto a redo stack if I cared to implement redo.
To implement "save changes," i.e. commit, I de-activate and time-stamp the original object and activate the temporary in it's place.
To implement "cancel," i.e. rollback, I do nothing! I could delete the temporary, of course, because there's no way for the user to retrieve it once the editing session is over, but I like to keep the canceled edit sessions so I can run stats on them before clearing them out with a cron job.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.