In a technical interview a questioner asked me a weird question regarding to the execution of querysets. Suppose we have a profile model like below:
class Profile(models.Model):
user = models.OneToOneField('User').select_related(User)
surname = models.TextField(null=True)
q = Profile.object.all()
or
q = Profile.object.get(id=1)
l = q.filter(active=True)
he asked how many query execution has been happened and I replied as the python interpreter executes Profile.object.all() at the begging then one query is already done. However, he answered zero, and one if we call the query, something like this:
for a in l:
a.surname
Is his answer true in django?
another doubt was about models.OneToOneField('User'), why he didn't use django.contrib.auth.models.User and defined models.OneToOneField('User').select_related(User)
QuerySets are not evaluated until you do something that actually needs them to be evaluated. As the documentation for the class itself states a QuerySet:
Represent a lazy database lookup for a set of objects.
Emphasis on the word lazy. This is because one often needs to call or chain methods on a queryset, a good example being a group by requiring subsequent calls to .values() and .annotate(). If a queryset was evaluated directly then we would be making too many unneeded queries to the database, slowing down execution to a halt.
As to when exactly a queryset is evaluated I would list the answer in short (for the long answer refer to When QuerySets are evaluated [Django docs]):
Iterating a queryset
Slicing a queryset (with the step parameter)
Pickling/Caching a queryset
Calling repr(), len(), list(), or bool() on a queryset
Various methods like get(), first(), last(), latest(), or earliest(), etc. also make a query to the database
Related
I am writing a method that computes a complex Django QuerySet.
It starts like this
qs1 = A.objects.filter(b_set__c_obj__user=user)
qs1 will eventually become the result, but before it does, the method goes on with several further steps of filtering and annotation.
b_set is an 1:n relationship, but I know that at most one of the c_obj can actually match.
I need to reference this c_obj, because I need another attribute email from it for one of the filtering steps (which is against instances of another model D selected based on c_obj's email).
user can be either a User model instance or an OuterRef Django ORM expression, because the whole queryset created by the method is subsequently also to be used in a subquery.
Therefore, any solution that involves evaluating querysets is not suitable for me. I can only build a single queryset.
Hmm?
I need to reference this c_obj, because I need another attribute email from it for one of the filtering steps (which is against instances of another model D selected based on c_obj's email).
If you do all the filtering with the same .filter(…) clause then that c_obj will be the same for all conditions. In other words if you want to filter A such that it has a related B and a related C for example that has as email='foo#bar.com', and as type='user', you can filter with:
qs1 = A.objects.filter(
b_set__c_obj__email='foo#bar.com',
b_set__c_obj__type='user'
)
This is thus different from:
qs1 = A.objects.filter(
b_set__c_obj__email='foo#bar.com'
).filter(
b_set__c_obj__type='user'
)
since here it will look for an A object that has a related c_obj with email='foo#bar.com and for a c_obj (not per se the same one) with type='user'.
I have a queryset that was used in below code.
result = 1 if queryset else 0
In case of small queryset it's okay but when queryset gets bigger (more than 500 000 results) program freezes, it takes some time to stop it.
What is happening behind the scenes when Django's queryset is tested in the code above?
Is some extra work performed during that check?
Even though the queryset is big, there is no problem with calling count() or iterator() or any other methods, it is that conditional expression where the problem appears.
Edit:
Queryset is too big. It populates Queryset's self._result_cache. Same thing happens for len() and iterating over queryset in a for loop.
Python will either use the __bool__ or __len__ methods to test the truth value of an object, and it looks like the implementation for the Queryset class fetches all records:
https://github.com/django/django/blob/master/django/db/models/query.py#L279
def __bool__(self):
self._fetch_all()
return bool(self._result_cache)
It might be a better idea to use if queryset.count() or if queryset.exists() if that's what you want.
I try some code like this:
mymodels = MyModel.objects.filter(status=1)
mymodels.update(status=4)
print(mymodels)
And the result is an empty list
I know that I can use a for loop to replace the update.
But it will makes a lot of update query.
Is there anyway to continue manipulate mymodels after the bulk update?
Remember that Django's QuerySets are lazy:
QuerySets are lazy – the act of creating a QuerySet doesn’t involve any database activity. You can stack filters together all day long, and Django won’t actually run the query until the QuerySet is evaluated
but the update() method function is actually applied immediately:
The update() method is applied instantly, and the only restriction on the QuerySet that is updated is that it can only update columns in the model’s main table, not on related models.
So while - in your code - are applying the update call after your filter, in reality it is being applied beforehand and therefore your objects status is being changed before the filter is (lazily) applied, meaning there are no matching records and the result is empty.
mymodels = MyModel.objects.filter(status=1)
objs = [obj for obj in mymodels] # save the objects you are about to update
mymodels.update(status=4)
print(objs)
should work.
Explanations why had been given by Timmy O'Mahony.
In my Django app very often I need to do something similar to get_or_create(). E.g.,
User submits a tag. Need to see if
that tag already is in the database.
If not, create a new record for it. If
it is, just update the existing
record.
But looking into the doc for get_or_create() it looks like it's not threadsafe. Thread A checks and finds Record X does not exist. Then Thread B checks and finds that Record X does not exist. Now both Thread A and Thread B will create a new Record X.
This must be a very common situation. How do I handle it in a threadsafe way?
Since 2013 or so, get_or_create is atomic, so it handles concurrency nicely:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.
If you are using MySQL, be sure to use the READ COMMITTED isolation
level rather than REPEATABLE READ (the default), otherwise you may see
cases where get_or_create will raise an IntegrityError but the object
won’t appear in a subsequent get() call.
From: https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
Here's an example of how you could do it:
Define a model with either unique=True:
class MyModel(models.Model):
slug = models.SlugField(max_length=255, unique=True)
name = models.CharField(max_length=255)
MyModel.objects.get_or_create(slug=<user_slug_here>, defaults={"name": <user_name_here>})
... or by using unique_togheter:
class MyModel(models.Model):
prefix = models.CharField(max_length=3)
slug = models.SlugField(max_length=255)
name = models.CharField(max_length=255)
class Meta:
unique_together = ("prefix", "slug")
MyModel.objects.get_or_create(prefix=<user_prefix_here>, slug=<user_slug_here>, defaults={"name": <user_name_here>})
Note how the non-unique fields are in the defaults dict, NOT among the unique fields in get_or_create. This will ensure your creates are atomic.
Here's how it's implemented in Django: https://github.com/django/django/blob/fd60e6c8878986a102f0125d9cdf61c717605cf1/django/db/models/query.py#L466 - Try creating an object, catch an eventual IntegrityError, and return the copy in that case. In other words: handle atomicity in the database.
This must be a very common situation. How do I handle it in a threadsafe way?
Yes.
The "standard" solution in SQL is to simply attempt to create the record. If it works, that's good. Keep going.
If an attempt to create a record gets a "duplicate" exception from the RDBMS, then do a SELECT and keep going.
Django, however, has an ORM layer, with it's own cache. So the logic is inverted to make the common case work directly and quickly and the uncommon case (the duplicate) raise a rare exception.
try transaction.commit_on_success decorator for callable where you are trying get_or_create(**kwargs)
"Use the commit_on_success decorator to use a single transaction for all the work done in a function.If the function returns successfully, then Django will commit all work done within the function at that point. If the function raises an exception, though, Django will roll back the transaction."
apart from it, in concurrent calls to get_or_create, both the threads try to get the object with argument passed to it (except for "defaults" arg which is a dict used during create call in case get() fails to retrieve any object). in case of failure both the threads try to create the object resulting in multiple duplicate objects unless some unique/unique together is implemented at database level with field(s) used in get()'s call.
it is similar to this post
How do I deal with this race condition in django?
So many years have passed, but nobody has written about threading.Lock. If you don't have the opportunity to make migrations for unique together, for legacy reasons, you can use locks or threading.Semaphore objects. Here is the pseudocode:
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
_lock = Lock()
def get_staff(data: dict):
_lock.acquire()
try:
staff, created = MyModel.objects.get_or_create(**data)
return staff
finally:
_lock.release()
with ThreadPoolExecutor(max_workers=50) as pool:
pool.map(get_staff, get_list_of_some_data())
Say I have 2 models:
class Poll(models.Model):
category = models.CharField(u"Category", max_length = 64)
[...]
class Choice(models.Model):
poll = models.ForeignKey(Poll)
[...]
Given a Poll object, I can query its choices with:
poll.choice_set.all()
But, is there a utility function to query all choices from a set of Poll?
Actually, I'm looking for something like the following (which is not supported, and I don't seek how it could be):
polls = Poll.objects.filter(category = 'foo').select_related('choice_set')
for poll in polls:
print poll.choice_set.all() # this shouldn't perform a SQL query at each iteration
I made an (ugly) function to help me achieve that:
def qbind(objects, target_name, model, field_name):
objects = list(objects)
objects_dict = dict([(object.id, object) for object in objects])
for foreign in model.objects.filter(**{field_name + '__in': objects_dict.keys()}):
id = getattr(foreign, field_name + '_id')
if id in objects_dict:
object = objects_dict[id]
if hasattr(object, target_name):
getattr(object, target_name).append(foreign)
else:
setattr(object, target_name, [foreign])
return objects
which is used as follow:
polls = Poll.objects.filter(category = 'foo')
polls = qbind(polls, 'choices', Choice, 'poll')
# Now, each object in polls have a 'choices' member with the list of choices.
# This was achieved with 2 SQL queries only.
Is there something easier already provided by Django? Or at least, a snippet doing the same thing in a better way.
How do you handle this problem usually?
Time has passed and this functionality is now available in Django 1.4 with the introduction of the prefetch_related() QuerySet function. This function effectively does what is performed by the suggested qbind function. ie. Two queries are performed and the join occurs in Python land, but now this is handled by the ORM.
The original query request would now become:
polls = Poll.objects.filter(category = 'foo').prefetch_related('choice_set')
As is shown in the following code sample, the polls QuerySet can be used to obtain all Choice objects per Poll without requiring any further database hits:
for poll in polls:
for choice in poll.choice_set:
print choice
Update: Since Django 1.4, this feature is built in: see prefetch_related.
First answer: don't waste time writing something like qbind until you've already written a working application, profiled it, and demonstrated that N queries is actually a performance problem for your database and load scenarios.
But maybe you've done that. So second answer: qbind() does what you'll need to do, but it would be more idiomatic if packaged in a custom QuerySet subclass, with an accompanying Manager subclass that returns instances of the custom QuerySet. Ideally you could even make them generic and reusable for any reverse relation. Then you could do something like:
Poll.objects.filter(category='foo').fetch_reverse_relations('choices_set')
For an example of the Manager/QuerySet technique, see this snippet, which solves a similar problem but for the case of Generic Foreign Keys, not reverse relations. It wouldn't be too hard to combine the guts of your qbind() function with the structure shown there to make a really nice solution to your problem.
I think what you're saying is, "I want all Choices for a set of Polls." If so, try this:
polls = Poll.objects.filter(category='foo')
choices = Choice.objects.filter(poll__in=polls)
I think what you are trying to do is the term "eager loading" of child data - meaning you are loading the child list (choice_set) for each Poll, but all in the first query to the DB, so that you don't have to make a bunch of queries later on.
If this is correct, then what you are looking for is 'select_related' - see https://docs.djangoproject.com/en/dev/ref/models/querysets/#select-related
I noticed you tried 'select_related' but it didn't work. Can you try doing the 'select_related' and then the filter. That might fix it.
UPDATE: This doesn't work, see comments below.