How efficient is passing an object over doing a get? - python

I have a view function which does a get on objects(say A,B & C) using their Ids.
This view function calls a local function.
Should I be passing the objects to the local function or should I pass the Ids and do a get again there? Which is more efficient?
Which is a bigger overhead, passing an object or retreiving an object using get?

I'm not sure why you would think there is any overhead in passing an object into a function. That will always be cheaper than querying it from the database again, which would mean constructing the query, calling the database, and instantiating something from the result.
The only time you would definitely need to pass IDs rather than the object is in an asynchronous context like a Celery task; there, you want to be sure that you get the most recent version of the object which might have been changed in the DB by the time the task is processed.

Passing the python object has no overhead, while passing the id repeating the get() is going to incurr a database lookup which is much more costly.
However it depends on the particular case which one should be preferred. Consider a multiuser case in which two users modify the same object:
from django.db import models
class Movie(models.Model):
coolness = models.IntegerField()
times_modified = models.IntegerField()
A view function might call some heavy computation to calculate coolness from a request:
def update_coolness(delta_coolness, id):
movie = Movie.objects.get(pk=id) # Read object from database
movie.times_modified += 1
movie.save()
time.sleep(5) # Heavy computation
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()
If multiple users execute this query at the same time, the results are likely to be because of traditional parallel computing problems.
You can avoid saving the wrong coolness value by repeating the get request, or simply refreshing the object:
time.sleep(5) # Heavy computation
# movie = Movies.objects.get(pk=id) # make sure we have the newest version
movie.refresh_from_db() # make sure we have the newest version
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()
Which massively reduces the timeslot in which other users could interfere with your computation. However the clean solution is to use transaction.atomic which will lock the item during the process:
from django.db import transaction
def update_coolness(delta_coolness,id):
with transaction.atomic():
movie = Movie.objects.get(pk=id) # Read object from database
movie.times_modified+=1
movie.save()
time.sleep(5) # Heavy computation
with transaction.atomic():
movie.refresh_from_db() # make sure we have the newest version
movie.coolness += delta_coolness*4 # Very cool movie
movie.save()

Passing arguments around in this way is quite cheap: under the hood, it is implemented in terms of a single additional pointer. This will almost certainly be faster than invoking the django machinery again, which for a lookup by ID has to involve a (still fast, but relatively slower) dictionary lookup at minimum, or if it doesn't do caching, could involve requerying the database (which is going to be noticeably slow, especially if the database is big).
Prefer passing local variables around where possible unless there is a benefit to code clarity from doing it otherwise (but I can't think of any cases where local variables wouldn't be the clearer option), or if the "outside world" captured by that object might have changed in ways you need to be aware of.

Related

Performing an operation on every document in a MongoDB instance

I have a mongoDB collection with 1.5 million documents, all of which have the same fields, and I want to take the contents of Field A (which is unique in every document) and perform f(A) on it, then create and populate Field B. Pseudocode in Python:
for i in collection.find():
x = i**2
collection.update(i,x) #update i with x
NOTE: I am aware that the update code is probably wrong, but unless it affects the speed of operation, I chose to leave it there for the sake of simplicity
The problem is, this code is really really slow, primarily because it can run through 1000 documents in about a second, then the server cuts off the cursor for about a minute,then it allows another 1000. I'm wondering if there is any way to optimize this operation, or if I'm stuck with this slow bottleneck.
Additional notes:
I have adjusted batch_size as an experiment, it is faster, but it's not efficient, and still takes hours
I am also aware that SQL could probably do this faster, there are other reasons I am using an noSQL DB that are not relevant to this problem
The instance is running locally so for all intents and purposes, there is not network latency
I have seen this question, but it's answer doesn't really address my problem
Database clients tend to be extremely abstracted from actual database activity, so observed delay behaviors can be deceptive. It's likely that you are actually hammering the database in that time, but the activity is all hidden from the Python interpreter.
That said, there are a couple things you can do to make this lighter.
1) Put an index on the property A that you're basing the update on. This will allow it to return much faster.
2) Put a projection operator on your find call:
for doc in collection.find(projection=['A']):
That will ensure that you only return the fields you need to, and if you've properly indexed the unique A property, will ensure your results are drawn entirely from the very speedy index.
3) Use an update operator to ensure you only have to send the new field back. Rather than send the whole document, send back the dictionary:
{'$set': {'B': a**2}}
which will create the field B in each document without affecting any of the other content.
So, the whole block will look like this:
for doc in collection.find(projection=['A', '_id']):
collection.update(filter={'_id': doc['_id']},
update={'$set': {'B': doc['A']**2}})
That should cut down substantially on the work that Mongo has to do, as well as (currently irrelevant to you) network traffic.
Maybe you should do your updates in multiple threads. I think it may be better to load data in one thread, split it into multiple parts and pass that parts to parallel worker threads that will perform updates. It will be faster.
EDIT:
I suggest you doing paginated queries.
Python pseudocode:
count = collection.count()
page_size = 20
i = 0;
while(i < count):
for row in collection.find().limit(pageSize).skip(i):
x = i**2
collection.update(i, x);
i += page_size

Django: updating many objects with per-object calculation

This question is a continuation of one I asked yesterday: I'm still not sure if a post_save handler or a 2nd Celery task is the best way to update many objects based on the results of the first Celery task, but I plan to test performance down the line. Here's a recap of what's happening:
Celery task, every 30s:
Update page_count field of Book object based on conditions
|
post_save(Book) |
V
Update some field on all Reader objects w/ foreign key to updated Book
(update will have different results per-Reader, thousands of Readers could be FKed to Book)
The first task could save ~10 objects, requiring the update to all related Reader objects for each.
Whichever proves to be better between post_save and another task, they must accomplish the same thing: update potentially tens to hundreds of thousands of objects in a table, with each object update being unique. It could be that my choice between post_save and Celery task is determined by which method will actually allow me to accomplish this goal.
Since I can't just use a few queryset update() commands, I need to somehow call a method or function that calculates the value of a field based on the result of the first Celery task as well as some of the values in the object. Here's an example:
class Reader(models.Model):
book = models.ForeignKey(Book)
pages_read = models.IntegerField(default=0)
book_finished = models.BooleanField(default=False)
def determine_book_finished(self):
if self.pages_read == book.page_count:
self.book_finished = True
else:
self.book_finished = False
This is a contrived example, but if the page_count was updated in the first task, I want all Readers foreign keyed to the Book to have their book_finished recalculated- and looping over a queryset seems like a really inefficient way to go about it.
My thought was to somehow call a model method such as determine_book_finished() on an entire queryset at once, but I can't find any documentation on how to do something like that- custom querysets don't appear to be intended for actually operating on objects in the queryset beyond the built-in update() capability.
This post using Celery is the most promising thing I've found, and since Django signals are sync, using another Celery task would also have the benefit of not holding anything else up. So even though I'd still need to loop over a queryset, it'd be async and any querysets that needed to be updated could be handled by separate tasks, hopefully in parallel.
On the other had, this question seems to have a solution too- register the method with the post_save signal, which presumably would run the method on all objects after receiving the signal. Would this be workable with thousands of objects needing update, as well as potentially other Books being updated by the same task and their thousands of associated Readers then needing update too?
Is there a best practice for doing what I'm trying to do here?
EDIT: I realized I could go about this another way- making the book_finished field a property determined at runtime rather than a static field.
#property
def book_finished:
if self.pages_read == self.book.page_count:
if self.book.page_count == self.book.planned_pages:
return True
else:
return False
This is close enough to my actual code- in that, the first if branch contains a couple elif branches, with each having their own if-else for a total maximum depth of 3 ifs.
Until I can spin up a lot of test data and simulate many simultaneous users, I may stick with this option as it definitely works (for now). I don't really like having the property calculated every retrieval, but from some quick research, it doesn't seem like an overly slow method.

Django long running task - Database consistancy

I have a task running in a thread that will save its input-data after quite some while. Django typically saves the whole object which can have changed in the meantime. I also don't want to run the transaction that long since it will fail or block other tasks. My solution is to reload the data and save the result. Is this the way to go or is there some optimistic locking scheme, partial save or something else I should use?
My solution:
with transaction.atomic():
obj = mod.TheModel.objects.get(id=the_id)
# Work the task
with transaction.atomic():
obj = mod.TheObject.objects.get(id=obj.id)
obj.result = result
obj.save()
Generally, if you don't want to block other tasks doing long operations, you want these operations to be asynchronous.
There exists libraries to do this kind of tasks with Django. The most famous is probably http://www.celeryproject.org/
I would recommend using a partial save via Django QuerySet's update method. There's an update_fields keyword argument in the save method on instances which will limit the fields to be saved. However any logic within the save method itself might rely on other data being up to date. There's a relatively new instance update_from_db method for that purpose. However if your save method isn't overridden then both with produce exactly the same SQL anyway. With update avoiding any potential issues with data integrity.
Example:
num_changed = mod.TheObject.objects.filter(id=obj.id).update(result=result)
if num_changed == 0:
# Handle as you would have handled DoesNotExist from your get call

LOTS of call to the expire method of the SQLAlchemy's InstanceState class

I'm performing data crunching tasks with 11 parallel processes and the results of each computation is logged in an InnoDB table of a MySQL database, using the ORM of SQLAlchemy. However, processing times are larger than expected. If I profile the execution of one of these parallel process, I can see that about 30% of the time is spent in the expire method of the InstanceState class, which gets called... 292,957,736 times!
The computation is performs a loop with 17,106 iterations, and one commit is performed for each iteration. In the profile, I see that the commit method is called 17,868 which seem to be in the good order of magnitude (the 761 supplementary commit probably being from other parts of the surrounding code). However, it is not that clear to me what that expire method does and why it should be called that many time. Is it called on EVERY rows of the table at every commit or what? It looks a bit like that since if 17,106^2 == 292,615,236... Is this behavior normal? Are there any recipes or advices on how to do thing better in this kind of situation? The exact code is a bit complicated [it is in the __computeForEvent(...) method of this file] but, the SQLAlchemy part is conceptually equivalent to this:
for i in range(17106):
propagations = []
for i in range(19):
propagations.append(Propagation(...))
session.add_all(propagations)
session.commit()
where Propagation is a Base subclass.
Any advice on how to speed things and avoid this explosion of expire(...) calls would be very appreciated.
292M calls to expire() would indicate that there are this many objects present in memory when commit() is called, which is an unbelievably huge number in fact.
One immediate way to nix these expire calls is just to turn expire_on_commit to False:
sess = Session(expire_on_commit=False)
a more subtle way to get around this, but that would require a little more care, is to just not hold onto all those objects in memory, if we did:
for i in range(17106):
session.add_all([Propagation() for i in range(19)])
session.commit()
if that list of Propagation() objects were not strongly referenced without reference cycles, assuming cPython they would be garbage collected at the point of de-reference, and would not be subject to the expiration call within commit().
Still another strategy might be just to delay the commit() until after the loop, instead using flush() to process each set of items at a time. that way most objects will have been garbage collected by the time the commit() is reached.
expire_on_commit remains the most direct way to solve this, though.

How to fetch Riak object, change its value and store it back with all indexes in Python

I am using Riak database to store my Python application objects that are used and processed in parallel by multiple scripts. Because of that, I need to lock them in various places, to avoid being processed by more than one script at once, like that:
riak_bucket = riak_connect('clusters')
cluster = riak_bucket.get(job_uuid).get_data()
cluster['status'] = 'locked'
riak_obj = riak_bucket.new(job_uuid, data=cluster)
riak_obj.add_index('username_bin', cluster['username'])
riak_obj.add_index('hostname_bin', cluster['hostname'])
riak_obj.store()
The thing is, this is quite a bit of code to do one simple, repeatable thing, and given the fact locking occurs quite often, I would like to find a simpler, cleaner way of doing that. I've tried to write a function to do locking/unlocking, like that (for a different object, called 'build'):
def build_job_locker(uuid, status='locked'):
riak_bucket = riak_connect('builds')
build = riak_bucket.get(uuid).get_data()
build['status'] = status
riak_obj = riak_bucket.new(build['uuid'], data=build)
riak_obj.add_index('cluster_uuid_bin', build['cluster_uuid'])
riak_obj.add_index('username_bin', build['username'])
riak_obj.store()
# when locking, return the locked db object to avoid fetching it again
if 'locked' in status:
return build
else:
return
but since the objects are obviously quite different one from another, they've different indexes and so on, I ended up writing a locking function per every object... which is almost as much messy as not having the functions at all and repeating the code.
The question is: is there a way to write a general function to do so, knowing that every object has a 'status' field, that'd lock them in db retaining all indexes and other attributes? Or, perhaps, there is another, easier way I havent thought about?
After some more research, and questions asked on various IRC channels it seems that this is not doable, as there's no way to fetch this kind of metadata about objects from Riak.

Categories

Resources