I've got a Django view that I'm trying to optimise. It shows a list of parent objects on a page, along with their children. The child model has the foreign key back to the parent, so select_related doesn't seem to apply.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent)
A naive implementation uses n+1 queries, where n is the number of parent objects, ie. one query to fetch the parent list, then one query to fetch the children of each parent.
I've written a view that does the job in two queries - one to fetch the parent objects, another to fetch the related children, then some Python (that I'm far too embarrassed to post here) to put it all back together again.
Once I found myself importing the standard library's collections module I realised that I was probably doing it wrong. There is probably a much easier way, but I lack the Django experience to find it. Any pointers would be much appreciated!
Add a related_name to the foreign key, then use the prefetch_related method which added to Django 1.4:
Returns a QuerySet that will automatically retrieve, in a single
batch, related objects for each of the specified lookups.
This has a similar purpose to select_related, in that both are
designed to stop the deluge of database queries that is caused by
accessing related objects, but the strategy is quite different:
select_related works by creating a SQL join and including the fields
of the related object in the SELECT statement. For this reason,
select_related gets the related objects in the same database query.
However, to avoid the much larger result set that would result from
joining across a 'many' relationship, select_related is limited to
single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to
prefetch many-to-many and many-to-one objects, which cannot be done
using select_related, in addition to the foreign key and one-to-one
relationships that are supported by select_related. It also supports
prefetching of GenericRelation and GenericForeignKey.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent, related_name='children')
>>> Parent.objects.all().prefetch_related('children')
All the relevant children will be fetched in a single query, and used
to make QuerySets that have a pre-filled cache of the relevant
results. These QuerySets are then used in the self.children.all()
calls.
Note 1 that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously
cached results, and retrieve data using a fresh database query.
Note 2 that if you use iterator() to run the query, prefetch_related() calls will be ignored since these two
optimizations do not make sense together.
If you ever need to work with more than 2 levels at once, you can consider a different approach to storing trees in db using MPTT
In a nutshell, it adds data to your model which are updated during updates and allow a much more efficient retrieval.
Actually, select_related is what you are looking for. select_related creates a JOIN so that all the data that you need is fetched in one statement. prefetch_related runs all the queries at once then caches them.
The trick here is to "join in" only what you absolutely need to in order to reduce the performance penalty of the join. "What you absolutely need to" is the long way of saying that you should pre-select only the fields that you will read later in your view or template. There is good documentation here: https://docs.djangoproject.com/en/1.4/ref/models/querysets/#select-related
This is a snippet from one of my models where I faced a similar problem:
return QuantitativeResult.objects.select_related(
'enrollment__subscription__configuration__analyte',
'enrollment__subscription__unit',
'enrollment__subscription__configuration__analyte__unit',
'enrollment__subscription__lab',
'enrollment__subscription__instrument_model'
'enrollment__subscription__instrument',
'enrollment__subscription__configuration__method',
'enrollment__subscription__configuration__reagent',
'enrollment__subscription__configuration__reagent__manufacturer',
'enrollment__subscription__instrument_model__instrument__manufacturer'
).filter(<snip, snip - stuff edited out>)
In this pathological case, I went down from 700+ queries to just one. The django debug toolbar is your friend when it comes to this sort of issue.
Related
I want to limit the queries for a detail view. I want to access multiple many to many fields for one class instance in less query. It seems prefetch_related doesn't work with get and the server hits he database for every manytomany field.
JobInstance = Job.objects.get(pk=id).prefetch_related('cities').prefetch_related('experience_level')
You can let it work, by reordering it, like:
job_instance = Job.objects.prefetch_related('cities', 'experience_level').get(pk=id)
A .prefetch_related(..) is defined on a QuerySet, when you perform a .get(..) then you fetch the object, and you are no longer working with a queryset.
But for a single object, .prefetch_related(..) will not improve efficiency. After all, .prefetch_related(..) will make here two extra queries to fetch the related objects, exactly as much as not prefetching, and later evaluating the related objects of the job_instance.
.prefetch_related(..) is therefore useful when you want to fetch the related objects of multiple objects in bulk.
I have a simple setup with django-haystack and whoosh engine. A search yielding 19 objects took me 8 seconds. I used the django-debug-toolbar to determine that i had a bunch of repeated queries.
I then updated my search view to prefetch relations, so that duplicate queries would not happen:
class MySearchView(SearchView):
template_name = 'search_results.html'
form_class = SearchForm
queryset = RelatedSearchQuerySet().load_all().load_all_queryset(
models.Customer, models.Customer.objects.all().select_related('customer_number').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Contact, models.Contact.objects.all().select_related('customer')
).load_all_queryset(
models.Account, models.Account.objects.all().select_related(
'customer', 'account_number', 'main_contact', 'main_contact__customer'
)
).load_all_queryset(
models.Invoice, models.Invoice.objects.all().select_related(
'customer', 'end_customer', 'customer__original', 'end_customer__original', 'quote_number', 'invoice_number'
)
).load_all_queryset(
models.File, models.File.objects.all().select_related('file_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Import, models.Import.objects.all().select_related('import_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Event, models.Event.objects.all().prefetch_related('customers', 'contracts', 'accounts', 'keywords')
)
But even then, the search still takes 5 seconds. I then used the profiler from django-debug-toolbar, which gave me this information:
From what I can tell, the issue lies in haystack/query:779::__getitem__, which is hit twice, each costing 1.5 seconds. I have glanced through the code in question, but cannot make sense of it. So where do I go from here?
You say in the question:
I then updated my search view to prefetch relations […]
The code you present, though, does not use QuerySet.prefetch_related for most of them. Instead, your sample code uses QuerySet.select_related for most of them; this does not pre-fetch the objects.
The documentation for each of those methods is extensive and can help to decide which is correct for your case.
In particular, the QuerySet.prefetch_related documentation says:
select_related works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python. This allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related. It also supports prefetching of GenericRelation and GenericForeignKey, however, it must be restricted to a homogeneous set of results. For example, prefetching objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.
Try adding
HAYSTACK_LIMIT_TO_REGISTERED_MODELS = False
to your settings.py. As per the docks,
'If your search index is never used for anything other than the models registered with Haystack, you can turn this off and get a small to moderate performance boost.'
It knocked 3-4 seconds off for my project
I've got a question about foreign key behaviour in Django.
I've defined a tree hierarchy in my models, where a parent-son relation is represented as a foreign key in the son model. Now, starting at the leaf level, I'd like to retrieve the parent, the parent's parent etc. as the objects I've defined.
This is possible by simply calling Leaf.objects.all() and accessing the objects normally from Python code.
But here come the troubles. For each such call, Django makes a SELECT query for the appropriate foreign ID. This is obviously terribly slow and inefficient. I'd like to tell Django something like "hey, just fetch me all the data including the foreign keys at once, just do the joins and all the stuff at the database side". Is that somehow posible?
Just use select_related():
Leaf.objects.select_related().all()
I have a django project with 5 different models in it. All of them has date field. Let's say i want to get all entries from all models with today date. Of course, i could just filter every model, and put results in one big list, but i believe it's bad. What would be efficient way to do that?
I don't think that it's a bad idea to query each model separately - indeed, from a database perspective, I can't see how you'd be able to do otherwise, as each model will need a separate SQL query. Even if, as #Nagaraj suggests, you set up a common Date model every other model references, you'd still need to query each model separately. You are probably correct, however, that putting the results into a list is bad practice, unless you actually need to load every object into memory, as explained here:
Be warned, though, that [evaluating a QuerySet as a list] could have a large memory overhead, because Django will load each element of the list into memory. In contrast, iterating over a QuerySet will take advantage of your database to load data and instantiate objects only as you need them.
It's hard to suggest other options without knowing more about your use case. However, I think I'd probably approach this by making a list or dictionary of QuerySets, which I could then use in my view, e.g.:
querysets = [cls.objects.filter(date=now) for cls in [Model1, Model2, Model3]]
Take a look at using Multiple Inheritance (docs here) to define those date fields in a class that you can subclass in the classes you want to return in the query.
For example:
class DateStuff(db.Model):
date = db.DateProperty()
class MyClass1(DateStuff):
...
class MyClass2(DateStuff):
...
I believe Django will let you query over the DateStuff class, and it'll return objects from MyClass1 and MyClass2.
Thank #nrabinowitz for pointing out my previous error.
Here is the situation: I have a parent model say BlogPost. It has many Comments. What I want is the list of BlogPosts ordered by the creation date of its' Comments. I.e. the blog post which has the most newest comment should be on top of the list. Is this possible with SQLAlchemy?
http://www.sqlalchemy.org/docs/05/mappers.html#controlling-ordering
As of version 0.5, the ORM does not
generate ordering for any query unless
explicitly configured.
The “default” ordering for a
collection, which applies to
list-based collections, can be
configured using the order_by keyword
argument on relation():
I had the same question as the parent when using the ORM, and GHZ's link contained the answer on how it's possible. In sqlalchemy, assuming BlogPost.comments is a mapped relation to the Comments table, you can't do:
session.query(BlogPost).order_by(BlogPost.comments.creationDate.desc())
, but you can do:
session.query(BlogPost).join(Comments).order_by(Comments.creationDate.desc())