How to optimize DRF number queries when using properties in serializers fields?

How to optimize DRF number queries when using properties in serializers fields? - python

I am working on a DRF API and I am not completely familiar with django properties.
The DB relationships are classic. Companies have different jobs to which candidates can apply. Each job has several matches, match being a joined table between a job and a candidate. Matches have different statuses representing different phases of an application process.
So here is the deal:
I am using a drf viewset to get data from the api. This viewset uses a serializer to get specific fields, specifically the number of matches per status for a job. The simplified version of the serializer looks something like this.
class Team2AmBackofficeSerializer(Normal2JobSerializer):
class Meta:
model = Job
fields = (
'pk',
'name',
'company',
'company_name',
'job__nb_matches_proposition',
'job__nb_matches_preselection',
'job__nb_matches_valides',
'job__nb_matches_pitches',
'job__nb_matches_entretiens',
'job__nb_matches_offre',
)
The job__xxx fields are using the decorator #property, for instance:
#property
def job__nb_matches_offre(self):
return self.matches.filter(current_status__step_name='Offre').count()
The problem is each time I add one of these properties to my serializer's fields, the number of DB queries increases significantly. This is of course due to the fact that each property calls the DB multiple times. So here is my question:
Is there a way to optimize the number of queries made to the DB, either by changing something in the serializer or by getting the number of matches for a specific status in a different manner ?
I have had a look at select_related and prefetch_related. This allows me to reduce the numbers of queries when getting information about the company but not really for the number of matches.
Any help is greatly appreciated :)

What you want is to annotate your queryset with these values which will result in the database doing all the counting in just one query. The result is significantly faster than your current solution.
Example:
from django.db.models import Count, Q
Job.objects.annotate(
'nb_matches_offre'=Count(
'pk',
filter=Q(current_status__step_name='Offre')
),
'nb_matches_entretiens'=Count(...)
).all()
The resulting queryset will contain Job objects that have the properties job_obj.nb_matches_offre and job_obj.nb_matches_entretiens with the count.
See also https://docs.djangoproject.com/en/3.0/topics/db/aggregation/

Related

Django performant way to get a queryset of a massive (I'm talking huge) list of ids in order

Pretty much the same flavor as: Django get a QuerySet from array of id's in specific order. I tried https://stackoverflow.com/a/37648265/4810639
But my list of ids is huge (> 50000) and both qs = Foo.objects.filter(id__in=id_list) and qs = qs.order_by(preserved) buckle under the strain.
Note: I need a queryset due to the specific django method I'm overriding so anything returning a list won't work.
EDIT: In response to the comments I'm specifically overriding the get_search_results() in the admin. My search engine returns the id of the model(s) that match the query. But get_search_results() needs to return a queryset. Hence the large list of id's.

I did this by creating a FakeQueryset class that had enough of the functions of a regular queryset that it was able to act like one. Then when I needed to display it I would hand it over to a custom paginator that would only pull a few ids from the database at a time. Duck typing for the win!

Is using prefetch_related to retrieve related (parents, children pages) possible In Wagtail Pages?

The problem I am facing is duplicate number of queries which makes the app slow when using get_parent() or get_children() on Page model. which also increase if the parent page has an image file that is used in the template.
so I am looking for a way to prefetch_related pages without making a foriegn key relation.
let us say I have a TvSeries Page model and an Episode Page model:
class TvSeries(Page):
name = models.CharField()
producer = models.CharField()
subpage_types = ['Episode']
class Episode(Page):
title = models.CharField()
number = models.CharField()
parent_page_types = ['TvSeries']
Need to prefetch the TvSeries when querying the episode model! How to reduce the database calls? Is it possible to use prefetch and select related ? if yes how?. and if not, what is the solution to the increased number of queries?

prefetch_related cannot be used for parent/child page relations, because they do not use a standard Django ForeignKey relation - instead, Wagtail (and Treebeard) uses a path field to represent the tree position. This makes it possible to perform queries that can't be done efficiently with a ForeignKey, such as fetching all descendents (at any depth) of a page.
It should be noted that prefetch_related is not "free" - it will generate one extra query for every relation followed. Treebeard's query methods will usually be equal or better than this in efficiency - for example:
series = TvSeries.objects.get(id=123)
episodes = series.get_children()
will fetch a TvSeries and all of its episodes in two queries, just as a (hypothetical) prefetch_related expression would:
# fake code, will not work...
series = TvSeries.objects.filter(id=123).prefetch_related('child_pages')
However, one issue with get_children is that it will only return basic Page instances, so further queries are required to retrieve the specific fields from Episode. You can avoid this by using child_of instead:
series = TvSeries.objects.get(id=123)
episodes = Episode.objects.child_of(series)

Django haystack whoosh super slow

I have a simple setup with django-haystack and whoosh engine. A search yielding 19 objects took me 8 seconds. I used the django-debug-toolbar to determine that i had a bunch of repeated queries.
I then updated my search view to prefetch relations, so that duplicate queries would not happen:
class MySearchView(SearchView):
template_name = 'search_results.html'
form_class = SearchForm
queryset = RelatedSearchQuerySet().load_all().load_all_queryset(
models.Customer, models.Customer.objects.all().select_related('customer_number').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Contact, models.Contact.objects.all().select_related('customer')
).load_all_queryset(
models.Account, models.Account.objects.all().select_related(
'customer', 'account_number', 'main_contact', 'main_contact__customer'
)
).load_all_queryset(
models.Invoice, models.Invoice.objects.all().select_related(
'customer', 'end_customer', 'customer__original', 'end_customer__original', 'quote_number', 'invoice_number'
)
).load_all_queryset(
models.File, models.File.objects.all().select_related('file_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Import, models.Import.objects.all().select_related('import_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Event, models.Event.objects.all().prefetch_related('customers', 'contracts', 'accounts', 'keywords')
)
But even then, the search still takes 5 seconds. I then used the profiler from django-debug-toolbar, which gave me this information:
From what I can tell, the issue lies in haystack/query:779::__getitem__, which is hit twice, each costing 1.5 seconds. I have glanced through the code in question, but cannot make sense of it. So where do I go from here?

You say in the question:
I then updated my search view to prefetch relations […]
The code you present, though, does not use QuerySet.prefetch_related for most of them. Instead, your sample code uses QuerySet.select_related for most of them; this does not pre-fetch the objects.
The documentation for each of those methods is extensive and can help to decide which is correct for your case.
In particular, the QuerySet.prefetch_related documentation says:
select_related works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python. This allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related. It also supports prefetching of GenericRelation and GenericForeignKey, however, it must be restricted to a homogeneous set of results. For example, prefetching objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.

Try adding
HAYSTACK_LIMIT_TO_REGISTERED_MODELS = False
to your settings.py. As per the docks,
'If your search index is never used for anything other than the models registered with Haystack, you can turn this off and get a small to moderate performance boost.'
It knocked 3-4 seconds off for my project

Tastypie: queryset = Model.objects.all()

I am newbie to Tastypie. I see that tastypie call Django Models using queryset and displays data.
My question is: if Tastypie builds the statement queryset = < DJANGO-MODEL >.objects.all(),
will it put a tremendous load on the database/backend if there are 100 million objects?
class RestaurentsResource(ModelResource):
class Meta:
queryset = Restaurents.objects.all()
print queryset
resource_name = 'restaurents'

Django querysets are lazy: https://docs.djangoproject.com/en/dev/topics/db/queries/#querysets-are-lazy, so no database activity will be carried out until the queryset is evaluated.
If you return all 1000 objects from your REST interface, then a 'tremendous' load will be placed on your server, usually pagination: http://django-tastypie.readthedocs.org/en/latest/paginator.html or similar is used to prevent this.
Calling print on the queryset as in the example class above, will force evaluation. Doing this in production code is a bad idea, although it can be handy when debugging or as a learning tool.

The two other answers are correct in terms of QuerySets being lazy. But on top of that, the queryset you specify in the Meta class is the base for the query. In Django, a QuerySet is essentially the representation of a database query, but is not executed. QuerySets can be additionally filtered before a query is executed.
So you could have code that looks like:
Restaurant.objects.all().filter(attribute1=something).filter(attribute2=somethindelse
Tastypie just uses the QuerySet you provide as the base. On each API access, it adds additional filters to the base before executing the new query. Tastypie also handles some pagination, so you can get paginated results so not every row is returned.
While using all() is very normal, this feature is most useful if you want to limit your Tastypie results. Ie, if your Restaurant resource has a 'hidden' field, you might set:
class Meta:
queryset = Restaurant.objects.filter(hidden=False)
All queries generated by the API will use the given queryset as the base, and won't show any rows where 'hidden=True'.

Django QuerySet objects are evaluated lazily, that is - the result is fetched from the db when it is really needed. In this case, queryset = Restaurents.objects.all() create a QuerySet that has not yet been evaluated.
The default implementation of ModelResource usually forces the queryset to be evaluated at dehydration time or paging. The first one requires model objects to be passed, the other one slices the queryset.
Custom views, authorization, or filtering methods can force the evaluation earlier.
That said, after doing all the filtering and paging, the results' list fetched is considerably smaller that the overall amount of data in the database.

List of parents objects and their children with fewer queries

I've got a Django view that I'm trying to optimise. It shows a list of parent objects on a page, along with their children. The child model has the foreign key back to the parent, so select_related doesn't seem to apply.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent)
A naive implementation uses n+1 queries, where n is the number of parent objects, ie. one query to fetch the parent list, then one query to fetch the children of each parent.
I've written a view that does the job in two queries - one to fetch the parent objects, another to fetch the related children, then some Python (that I'm far too embarrassed to post here) to put it all back together again.
Once I found myself importing the standard library's collections module I realised that I was probably doing it wrong. There is probably a much easier way, but I lack the Django experience to find it. Any pointers would be much appreciated!

Add a related_name to the foreign key, then use the prefetch_related method which added to Django 1.4:
Returns a QuerySet that will automatically retrieve, in a single
batch, related objects for each of the specified lookups.
This has a similar purpose to select_related, in that both are
designed to stop the deluge of database queries that is caused by
accessing related objects, but the strategy is quite different:
select_related works by creating a SQL join and including the fields
of the related object in the SELECT statement. For this reason,
select_related gets the related objects in the same database query.
However, to avoid the much larger result set that would result from
joining across a 'many' relationship, select_related is limited to
single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to
prefetch many-to-many and many-to-one objects, which cannot be done
using select_related, in addition to the foreign key and one-to-one
relationships that are supported by select_related. It also supports
prefetching of GenericRelation and GenericForeignKey.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent, related_name='children')
>>> Parent.objects.all().prefetch_related('children')
All the relevant children will be fetched in a single query, and used
to make QuerySets that have a pre-filled cache of the relevant
results. These QuerySets are then used in the self.children.all()
calls.
Note 1 that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously
cached results, and retrieve data using a fresh database query.
Note 2 that if you use iterator() to run the query, prefetch_related() calls will be ignored since these two
optimizations do not make sense together.

If you ever need to work with more than 2 levels at once, you can consider a different approach to storing trees in db using MPTT
In a nutshell, it adds data to your model which are updated during updates and allow a much more efficient retrieval.

Actually, select_related is what you are looking for. select_related creates a JOIN so that all the data that you need is fetched in one statement. prefetch_related runs all the queries at once then caches them.
The trick here is to "join in" only what you absolutely need to in order to reduce the performance penalty of the join. "What you absolutely need to" is the long way of saying that you should pre-select only the fields that you will read later in your view or template. There is good documentation here: https://docs.djangoproject.com/en/1.4/ref/models/querysets/#select-related
This is a snippet from one of my models where I faced a similar problem:
return QuantitativeResult.objects.select_related(
'enrollment__subscription__configuration__analyte',
'enrollment__subscription__unit',
'enrollment__subscription__configuration__analyte__unit',
'enrollment__subscription__lab',
'enrollment__subscription__instrument_model'
'enrollment__subscription__instrument',
'enrollment__subscription__configuration__method',
'enrollment__subscription__configuration__reagent',
'enrollment__subscription__configuration__reagent__manufacturer',
'enrollment__subscription__instrument_model__instrument__manufacturer'
).filter(<snip, snip - stuff edited out>)
In this pathological case, I went down from 700+ queries to just one. The django debug toolbar is your friend when it comes to this sort of issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize DRF number queries when using properties in serializers fields? - python

Related

Django performant way to get a queryset of a massive (I'm talking huge) list of ids in order

Is using prefetch_related to retrieve related (parents, children pages) possible In Wagtail Pages?

Django haystack whoosh super slow

Tastypie: queryset = Model.objects.all()

List of parents objects and their children with fewer queries

Categories

Resources