Django prefetch_related and N+1 - How is it solved? - python

I am sitting with a query looking like this:
# Get the amount of kilo attached to products
product_data = {}
for productSpy in ProductSpy.objects.all():
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
I do not see how I on my last line would be able to use prefetch_related. In the examples in the docs it's very simplified and somehow makes sense, but I do not understand the whole concept enough to see myself out of this. Could I please get explained what's being done and how? I find this very important to understand, and where met by my first N+1 here.
Thank you up front for your time.
models.py
class ProductSpy(models.Model):
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
product = models.ForeignKey(Product, on_delete=models.CASCADE)
def __str__(self):
return self.kilo
class Product(models.Model):
product_id = models.IntegerField()
name = models.CharField(max_length=150)
def __str__(self):
return self.name

Django fetches related tables at runtime:
each call to productSpy.product will fetch from the table product using productSpy.id
The latency in I/O operation means that this code is highly inefficient. using prefetch_related will fetch product for all the product spy objects in one shot resulting in better performance.
# Get the amount of kilo attached to products
product_data = {}
product_spies = ProductSpy.objects.all()
product_spies.prefetch_related('product')
product_spies.prefetch_related('kilo')
for productSpy in product_spies:
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN

When one writes productSpy.product if the related object is not already fetched, Django makes automatically will make a query to the database to get the related Product instance. Hence if ProductSpy.objects.all() returned N instances by writing productSpy.product in a loop we will be making N more queries which is what we call N + 1 problem.
Moving further although you can use prefetch_related (will use 2 queries in your case) here it would be better for you to use select_related [Django docs] which will use a LEFT JOIN and get you the related instances in 1 query itself:
product_data = {}
queryset = ProductSpy.objects.select_related('product')
for productSpy in queryset:
product_data[productSpy.product.product_id] = productSpy.kilo # No extra queries as we used `select_related`
Note: There seems to be some problem with your logic here though, as multiple ProductSpy instances can have the same Product,
hence your loop might overwrite some values.

Related

Reducing Django Database Queries

I have very large dataset and growing, and I need to create many filters but it is going to quickly get out of control and was hoping someone can help me combine some of the queries into a single call. Below is the start of my view.
Call #1 - for loop to display table of all results
traffic = Traffic.objects.all()
Call #2 - Combined aggregate sum query
totals = Traffic.objects.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
Call #3, #4, #5, #6 and so on... - To filter the database by month and day of week
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=2).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=3).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=4).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=5).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=6).aggregate(Sum('sessions'))
The problem is , I need to create several dozen more filters because I have 3 years of data with multiple data points per column that we need totals the sum for.
Questions:
Can I combine call #1 into call #2
Can I use Call #2 to query the sums for call#3 so I don't have to call all objects from the database to filter it and then do this a couple more dozen times?
As you can see, this is going to get out of control very quickly. Any help would be hugely appreciated. Thank you.
Updated to add
Traffic Model
class Timestamp(models.Model):
created = models.DateField()
class Meta:
abstract = True
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True)
new_users = models.IntegerField(blank=True, null=True)
reminder = models.IntegerField(blank=True, null=True)
campaigns = models.IntegerField(blank=True, null=True)
new_sales = models.IntegerField(blank=True, null=True)
sales_renewals = models.IntegerField(blank=True, null=True)
# Meta and String
class Meta:
verbose_name = 'Traffic'
verbose_name_plural = 'Traffic Data'
def __str__(self):
return "%s" % self.created
There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them. Here's some quick tips for query optimization:
1) iterator()
If you are accessing the queryset only once. So for example you can use this as,
traffic = Traffic.objects.all()
for t in traffic.iterator():
...
...
2) db_index=True
While defining fields of your models. As the Django documentation says,
This is a number one priority, after you have determined from
profiling what indexes should be added. Use Field.db_index or
Meta.index_together to add these from Django. Consider adding indexes
to fields that you frequently query using filter(), exclude(),
order_by(), etc. as indexes may help to speed up lookups.
Hence you can modify your model as,
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True, db_index=True)
new_users = models.IntegerField(blank=True, null=True, db_index=True)
reminder = models.IntegerField(blank=True, null=True, db_index=True)
campaigns = models.IntegerField(blank=True, null=True, db_index=True)
new_sales = models.IntegerField(blank=True, null=True, db_index=True)
3) prefetch_related() or select_related()
If you have relations within your models, using prefetch_related or select_related would be a choice. As per the Django documentation,
select_related works by creating a SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a 'many' relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related.
select_related does a join, prefetch_related does two separate queries. Using these you can make your queries upto 30% faster.
4) Django Pagination
If your template design allows you to display results in multiple pages your can use Pagination.
5) Querysets are Lazy
You also need to understand that the Django Querysets are lazy which means that it won't query the database untill its being used/evaluated. A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example,
traffic = Traffic.objects.all()
The above code doesn’t run any database queries. You can can take the traffic queryset and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications. To fetch the data from the database, you need to iterate over the queryset:
for t in traffic.iterator():
print(t.sessions)
6) django-debug-toolbar
Django Debug Toolbar is a configurable set of panels that display various debug information about the current request/response and when clicked, display more details about the panel's content. This includes:
Request timer
SQL queries including time to execute and links to EXPLAIN each query
Modifying your code: (remember Querysets are Lazy)
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
t_2014 = traffic.filter(created__year='2014')
t_sessions_2014_wd2 = t_2014.filter(created__week_day=2).aggregate(Sum('sessions'))
...
...
For Call #1 in template (for loop to display table of all results):
{% for t in traffic.iterator %}
{{ t.sessions }}
...
...
{% endfor %}
As for Question 1, it shouldn't be a problem to reuse the queryset from the first call.
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
This should spare you an additional call to the database.
Regarding Question 2, you can again reuse the queryset from the first call, and filter the year, which gives you a new queryset, e.g.
traffic_2014 = traffic.filter(created__year='2014')
You can then continue filtering the days and aggregating with this new queryset, like you did before, or create new querysets for each day, assuming you aggregate multiple attributes every day, thus saving you another dozen database calls.
I hope this helps you.
Not addressing the questions directly, but I think you should consider a different approach.
Based on my understanding:
The view may be requested often.
The data should change rarely.
There is a need for complicated data manipulation (summing fields by year, month, day etc.)
There is no need to perform the same queries each time someone requests the view.
Load all the data in one step and perform manipulations inside the view. You can use a library like Pandas and create complicated data sets. The view will be now CPU bound so use a caching system like Redis to avoid recalculating. Invalidate when the data has changed.
Another approach: perform the calculations periodically by using a task queue like Celery and populate Redis.

Nested chain vs duplicated information

There is a models.py with 4 model.
Its standard record is:
class Main(models.Model):
stuff = models.IntegerField()
class Second(models.Model):
nested = models.ForeignKey(Main)
stuff = models.IntegerField()
class Third(models.Model):
nested = models.ForeignKey(Second)
stuff = models.IntegerField()
class Last(models.Model):
nested = models.ForeignKey(Third)
stuff = models.IntegerField()
and there is another variant of Last model:
class Last(models.Model):
nested1 = models.ForeignKey(Main)
nested2 = models.ForeignKey(Second)
nested = models.ForeignKey(Third)
stuff = models.IntegerField()
Will that way save some database load?
The information in nested1 and nested2 will duplicate fields in Secod and Third and even it may become outdated ( fortunately not in my case, as the data will not be changed, only new is added ).
But from my thoughts it may save database load, when I'm looking all Last records for a certain Main record. Or when I'm looking only for Main.id for specific Last item.
Am I right?
Will it really save the load or there is a better practice?
It all depends how you access the data. By default Django will make another call to the database when you access a foreign key. So if you want to make less calls to the database, you can use select_related to prefetch the models in foreign keys.

Why this is making two queries instead of one?

View:
transfer_details = TransferDetail.objects.filter(user=request.user).select_related('transfermethod_set')
print transfer_details.filter(method__name='PayPal')
Models:
class TransferMethod(models.Model):
name = models.CharField(max_length=30)
...
class TransferDetail(models.Model):
data = models.TextField()
...
method = models.ForeignKey(TransferMethod)
user = models.ForeignKey(User)
I expect transfer_details QuerySet from line one to be used without further database calls.
What I am missing?
UPDATE 1
So I discovered when I have these two lines there are no additional queries:
x = transfer_details.filter(method__name='PayPal')
x2 = transfer_details.filter(method__name='Something')
But when I add the following two lines, it's making 2 DB queries:
list(x[:1])
list(x2[:1])
What's happening under the hood and how I can avoid the extra calls?
UPDATE 2
I tried:
transfer_details.get(method__name='PayPal').data
...
It's also making two queries.
Correctly it should be (assuming you also want to get the user data in one query):
transfer_details = TransferDetail.objects.filter(
user=request.user).select_related('method', 'user')
You wouldn't need to select method because when you filter for it in print transfer_details.filter(method__name='PayPal') it should get selected automatically. When you call print TansferDetail's __unicode__ will get invoked, so a reason for additional could be that you're outputting some other related data there (eg. from the Usermodel, which should be solved with the code above...).
To answer your edited question: If you call list on a queryset the queryset gets evaluated, which means the actual query is made.
Don't know if you are accessing request.user at some point before in your code, but if that is not the case it's possible that the second query is the result of getting the user for the current request.

Django aggregation query on related one-to-many objects

Here is my simplified model:
class Item(models.Model):
pass
class TrackingPoint(models.Model):
item = models.ForeignKey(Item)
created = models.DateField()
data = models.IntegerField()
class Meta:
unique_together = ('item', 'created')
In many parts of my application I need to retrieve a set of Item's and annotate each item with data field from latest TrackingPoint from each item ordered by created field. For example, instance i1 of class Item has 3 TrackingPoint's:
tp1 = TrackingPoint(item=i1, created=date(2010,5,15), data=23)
tp2 = TrackingPoint(item=i1, created=date(2010,5,14), data=21)
tp3 = TrackingPoint(item=i1, created=date(2010,5,12), data=120)
I need a query to retrieve i1 instance annotated with tp1.data field value as tp1 is the latest tracking point ordered by created field. That query should also return Item's that don't have any TrackingPoint's at all. If possible I prefer not to use QuerySet's extra method to do this.
That's what I tried so far... and failed :(
Item.objects.annotate(max_created=Max('trackingpoint__created'),
data=Avg('trackingpoint__data')).filter(trackingpoint__created=F('max_created'))
Any ideas?
Here's a single query that will provide (TrackingPoint, Item)-pairs:
TrackingPoint.objects.annotate(max=Max('item__trackingpoint__created')).filter(max=F('created')).select_related('item').order_by('created')
You would have to query for items without TrackingPoints separately.
This isn't directly answer to your question, but in case don't need exactly what you described you might be interested in greatest-n-per-group solution. You can take a look on my answer on similar question:
Django Query That Get Most Recent Objects From Different Categories
-- this should apply directly to your case:
items = Item.objects.annotate(tracking_point_created=Max('trackingpoint__created'))
trackingpoints = TrackingPoint.objects.filter(created__in=[b.tracking_point_created for b in items])
Note that second line can produce ambiguous results if created dates repeat in TrackingPoint model.

How to do this join query in Django

In Django, I have two models:
class Product(models.Model):
name = models.CharField(max_length = 50)
categories = models.ManyToManyField(Category)
class ProductRank(models.Model):
product = models.ForeignKey(Product)
rank = models.IntegerField(default = 0)
I put the rank into a separate table because every view of a page will cause the rank to change and I was worried that all these writes would make my other (mostly read) queries slow down.
I gather a list of Products from a simple query:
cat = Category.objects.get(pk = 1)
products = Product.objects.filter(categories = cat)
I would now like to get all the ranks for these products. I would prefer to do it all in one go (using a SQL join) and was wondering how to express that using Django's query mechanism.
What is the right way to do this in Django?
This can be done in Django, but you will need to restructure your models a little bit differently:
class Product(models.Model):
name = models.CharField(max_length=50)
product_rank = models.OneToOneField('ProductRank')
class ProductRank(models.Model):
rank = models.IntegerField(default=0)
Now, when fetching Product objects, you can following the one-to-one relationship in one query using the select_related() method:
Product.objects.filter([...]).select_related()
This will produce one query that fetches product ranks using a join:
SELECT "example_product"."id", "example_product"."name", "example_product"."product_rank_id", "example_productrank"."id", "example_productrank"."rank" FROM "example_product" INNER JOIN "example_productrank" ON ("example_product"."product_rank_id" = "example_productrank"."id")
I had to move the relationship field between Product and ProductRank to the Product model because it looks like select_related() follows foreign keys in one direction only.
I haven't checked but:
products = Product.objects.filter(categories__pk=1).select_related()
Should grab every instance.
For Django 2.1
From documentation
This example retrieves all Entry objects with a Blog whose name is 'Beatles Blog':
Entry.objects.filter(blog__name='Beatles Blog')
Doc URL
https://docs.djangoproject.com/en/2.1/topics/db/queries/
Add a call to the QuerySet's select_related() method, though I'm not positive that grabs references in both directions, it is the most likely answer.

Categories

Resources