Django prefetch_related optimize query but still very slow - python

I'm experiencing some severe performances issues with prefetch_related on a Model with 5 m2m fields and I'm pre-fetching also few nested m2m fields.
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).prefetch_related("parent", "takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "status", "approvalWorkflow", "viewers", "requires", "linkedTasks", "activities")
class Task(models.Model):
uuid = models.UUIDField(primary_key=True, default=genOptimUUID, editable=False)
internalStatus = models.IntegerField(default=0)
parent = models.ForeignKey("self", blank=True, null=True, related_name="childs")
name = models.CharField(max_length=45)
taskType = models.ForeignKey("TaskType", null=True)
priority = models.IntegerField()
startDate = models.DateTimeField()
endDate = models.DateTimeField()
status = models.ForeignKey("ProgressionStatus")
assignedUser = models.ForeignKey("Asset", related_name="tasksAssigned")
asset = models.ForeignKey("Asset", related_name="tasksSubject")
viewers = models.ManyToManyField("Asset", blank=True, related_name="followedTasks")
step = models.ForeignKey("Step", blank=True, null=True, related_name="tasks")
approvalWorkflow = models.ForeignKey("ApprovalWorkflow")
linkedTasks = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="linkedTo")
requires = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="depends")
objects = TaskModelManager()
The number of query is fine and the database query time is fine too, for exemple if I query 700 objects of my model i have 35 query and the average query time is 100~200ms but the total request time is approximately 8 seconds.
silk times
I've run some profiling and it pointed out that more than 80% of the time spent was on the prefetch_related_objects call.
profiling
I'm using Django==1.8.5 and djangorestframework==3.4.6
I'm open to any way to optimize this.
Thanks in advance for your help.
Edit with select_related:
I've tried the improvement proposed by Alasdair
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).select_related("parent", "status", "approvalWorkflow", "step").prefetch_related("takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "viewers", "requires", "linkedTasks", "activities")
The new result is still 8 seconds for the request with 32 queries and 150ms of query time.
Edit :
It seems that a ticket was opened on Django issue tracker 4 years ago and is still open : https: //code.djangoproject.com/ticket/20577

I ran into the same problem.
Following the issue you linked i found that you can improve the prefetch_related performance using Prefetch object and to_attr argument.
From the commit that introduces the Prefetch object:
When a Prefetch instance specifies a to_attr argument, the result is
stored in a list rather than a QuerySet. This has the fortunate
consequence of being significantly faster. The preformance improvement
is due to the fact that we save the costly creation of a QuerySet
instance.
So i significantly improved my code (from about 7 seconds to 0.88 seconds) simply by calling:
Foo.objects.filter(...).prefetch_related(Prefetch('bars', to_attr='bars_list'))
instead of
Foo.objects.filter(...).prefetch_related('bars')

Try using select_related for foreign keys like parent and ApprovalWorkflow instead of prefetch_related.
When you use select_related, Django will fetch the models using a join, unlike prefetch_related which causes an extra query. You might find that this improves performance.

If the DB is 150ms but your request is 8 seconds, it's not your query (in itself, at least). A few possible issues:
1) Your HTML or template is too complex, spending too much time in generating the response. Or consider template caching.
2) All those objects are complex and you load too many fields, so while the query is fast, sending it and processing all those objects in Python is slow. Explore using only(), defer() and values() or value_list() to only load what you need.
Optimization is hard and we'd need more details to give you a better idea. I'd suggest installing Django Debug Toolbar (Django app) or Opbeat (3rd party utility), they can help you detect where your time is spent and then you can optimize accordingly.

Related

Limit prefetch_related to 1 by a certain criteria

So I have models like these
class Status(models.Mode):
name = models.CharField(max_length=255, choices=StatusName.choices, unique=True)
class Case(models.Model):
# has some fields
class CaseStatus(models.Model):
case = models.ForeignKey("cases.Case", on_delete=models.CASCADE, related_name="case_statuses")
status = models.ForeignKey("cases.Status", on_delete=models.CASCADE, related_name="case_statuses")
created = models.DateTimeField(auto_now_add=True)
I need to filter the cases on the basis of the status of their case-status but the catch is only the latest case-status should be taken into account.
To get Case objects based on all the case-statuses, this query works:
Case.objects.filter(case_statuses__status=status_name)
But I need to get the Case objects such that only their latest case_status object (descending created) is taken into account. Something like this is what I am looking for:
Case.objects.filter(case_statuses__order_by_created_first__status=status_name)
I have tried Prefetch as well but doesnt seem to work with my use-case
sub_query = CaseStatus.objects.filter(
id=CaseStatus.objects.select_related('case').order_by('-created').first().id)
Case.objects.prefetch_related(Prefetch('case_statuses', queryset=sub_query)).filter(
case_statuses__status=status_name)
This would be easy to solve in raw postgres by using limit 1. But not sure how can I make this work in Django ORM.
You can annotate your cases with their last status, and then filter on that status to be what you want.
from django.db.models import OuterRef
status_qs = CaseStatus.objects.filter(case=OuterRef('pk')).order_by('-created').values('status__name')[:1]
Case.objects.annotate(last_status=status_qs).filter(last_status=status_name)

Django prefetch_related and N+1 - How is it solved?

I am sitting with a query looking like this:
# Get the amount of kilo attached to products
product_data = {}
for productSpy in ProductSpy.objects.all():
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
I do not see how I on my last line would be able to use prefetch_related. In the examples in the docs it's very simplified and somehow makes sense, but I do not understand the whole concept enough to see myself out of this. Could I please get explained what's being done and how? I find this very important to understand, and where met by my first N+1 here.
Thank you up front for your time.
models.py
class ProductSpy(models.Model):
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
product = models.ForeignKey(Product, on_delete=models.CASCADE)
def __str__(self):
return self.kilo
class Product(models.Model):
product_id = models.IntegerField()
name = models.CharField(max_length=150)
def __str__(self):
return self.name
Django fetches related tables at runtime:
each call to productSpy.product will fetch from the table product using productSpy.id
The latency in I/O operation means that this code is highly inefficient. using prefetch_related will fetch product for all the product spy objects in one shot resulting in better performance.
# Get the amount of kilo attached to products
product_data = {}
product_spies = ProductSpy.objects.all()
product_spies.prefetch_related('product')
product_spies.prefetch_related('kilo')
for productSpy in product_spies:
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
When one writes productSpy.product if the related object is not already fetched, Django makes automatically will make a query to the database to get the related Product instance. Hence if ProductSpy.objects.all() returned N instances by writing productSpy.product in a loop we will be making N more queries which is what we call N + 1 problem.
Moving further although you can use prefetch_related (will use 2 queries in your case) here it would be better for you to use select_related [Django docs] which will use a LEFT JOIN and get you the related instances in 1 query itself:
product_data = {}
queryset = ProductSpy.objects.select_related('product')
for productSpy in queryset:
product_data[productSpy.product.product_id] = productSpy.kilo # No extra queries as we used `select_related`
Note: There seems to be some problem with your logic here though, as multiple ProductSpy instances can have the same Product,
hence your loop might overwrite some values.

Django ORM really slow iterating over QuerySet

I am working on a new project and had to build an outline of a few pages really quick.
I imported a catalogue of 280k products that I want to search through. I opted for Whoosh and Haystack to provide search, as I am using them on a previous project.
I added definitions for the indexing and kicked off that process. However, it seems that Django is really, really really slow to iterate over the QuerySet.
Initially, I thought the indexing was taking more than 24 hours - which seemed ridiculous, so I tested a few other things. I can now confirm that it would take many hours to iterate over the QuerySet.
Maybe there's something I'm not used to in Django 2.2? I previously used 1.11 but thought I use a newer version now.
The model I'm trying to iterate over:
class SupplierSkus(models.Model):
sku = models.CharField(max_length=20)
link = models.CharField(max_length=4096)
price = models.FloatField()
last_updated = models.DateTimeField("Date Updated", null=True, auto_now=True)
status = models.ForeignKey(Status, on_delete=models.PROTECT, default=1)
category = models.CharField(max_length=1024)
family = models.CharField(max_length=20)
family_desc = models.TextField(null=True)
family_name = models.CharField(max_length=250)
product_name = models.CharField(max_length=250)
was_price = models.FloatField(null=True)
vat_rate = models.FloatField(null=True)
lead_from = models.IntegerField(null=True)
lead_to = models.IntegerField(null=True)
deliv_cost = models.FloatField(null=True)
prod_desc = models.TextField(null=True)
attributes = models.TextField(null=True)
brand = models.TextField(null=True)
mpn = models.CharField(max_length=50, null=True)
ean = models.CharField(max_length=15, null=True)
supplier = models.ForeignKey(Suppliers, on_delete=models.PROTECT)
and, as I mentioned, there are roughly 280k lines in that table.
When I do something simple as:
from products.models import SupplierSkus
sku_list = SupplierSkus.objects.all()
len(sku_list)
The process will quickly suck up most CPU power and does not finish. Likewise, I cannot iterate over it:
for i in sku_list:
print(i.sku)
Will also just take hours and not print a single line. However, I can iterate over it using:
for i in sku_list.iterator():
print(i.sku)
That doesn't help me very much, as I still need to do the indexing via Haystack and I believe that the issues are related.
This wasn't the case with some earlier projects I've worked with. Even a much more sizeable list (3-5m lines) would be iterated over quite quickly. A query for list length will take a moment, but return the result in seconds rather than hours.
So, I wonder, what's going on?
Is this something someone else has come across?
Okay, I've found the problem was the Python MySQL driver. Without using the .iterator() method a for loop would get stuck on the last element in the QuerySet. I have posted a more detailed answer on an expanded question here.
I was not using the Django recommended mysqlclient. I was using the
one created by Oracle/MySQL. There seems to be a bug that causes an
iterator to get "stuck" on the last element of the QuerySet in a for
loop and be trapped in an endless loop in certain circumstances.
Coming to think of it, it may well be that this is a design feature of the MySQL driver. I remember having a similar issue with a Java version of this driver before. Maybe I should just ditch MySQL and move to PostgreSQL?
I will try to raise a bug with Oracle anyways.

Reducing Django Database Queries

I have very large dataset and growing, and I need to create many filters but it is going to quickly get out of control and was hoping someone can help me combine some of the queries into a single call. Below is the start of my view.
Call #1 - for loop to display table of all results
traffic = Traffic.objects.all()
Call #2 - Combined aggregate sum query
totals = Traffic.objects.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
Call #3, #4, #5, #6 and so on... - To filter the database by month and day of week
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=2).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=3).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=4).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=5).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=6).aggregate(Sum('sessions'))
The problem is , I need to create several dozen more filters because I have 3 years of data with multiple data points per column that we need totals the sum for.
Questions:
Can I combine call #1 into call #2
Can I use Call #2 to query the sums for call#3 so I don't have to call all objects from the database to filter it and then do this a couple more dozen times?
As you can see, this is going to get out of control very quickly. Any help would be hugely appreciated. Thank you.
Updated to add
Traffic Model
class Timestamp(models.Model):
created = models.DateField()
class Meta:
abstract = True
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True)
new_users = models.IntegerField(blank=True, null=True)
reminder = models.IntegerField(blank=True, null=True)
campaigns = models.IntegerField(blank=True, null=True)
new_sales = models.IntegerField(blank=True, null=True)
sales_renewals = models.IntegerField(blank=True, null=True)
# Meta and String
class Meta:
verbose_name = 'Traffic'
verbose_name_plural = 'Traffic Data'
def __str__(self):
return "%s" % self.created
There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them. Here's some quick tips for query optimization:
1) iterator()
If you are accessing the queryset only once. So for example you can use this as,
traffic = Traffic.objects.all()
for t in traffic.iterator():
...
...
2) db_index=True
While defining fields of your models. As the Django documentation says,
This is a number one priority, after you have determined from
profiling what indexes should be added. Use Field.db_index or
Meta.index_together to add these from Django. Consider adding indexes
to fields that you frequently query using filter(), exclude(),
order_by(), etc. as indexes may help to speed up lookups.
Hence you can modify your model as,
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True, db_index=True)
new_users = models.IntegerField(blank=True, null=True, db_index=True)
reminder = models.IntegerField(blank=True, null=True, db_index=True)
campaigns = models.IntegerField(blank=True, null=True, db_index=True)
new_sales = models.IntegerField(blank=True, null=True, db_index=True)
3) prefetch_related() or select_related()
If you have relations within your models, using prefetch_related or select_related would be a choice. As per the Django documentation,
select_related works by creating a SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a 'many' relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related.
select_related does a join, prefetch_related does two separate queries. Using these you can make your queries upto 30% faster.
4) Django Pagination
If your template design allows you to display results in multiple pages your can use Pagination.
5) Querysets are Lazy
You also need to understand that the Django Querysets are lazy which means that it won't query the database untill its being used/evaluated. A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example,
traffic = Traffic.objects.all()
The above code doesn’t run any database queries. You can can take the traffic queryset and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications. To fetch the data from the database, you need to iterate over the queryset:
for t in traffic.iterator():
print(t.sessions)
6) django-debug-toolbar
Django Debug Toolbar is a configurable set of panels that display various debug information about the current request/response and when clicked, display more details about the panel's content. This includes:
Request timer
SQL queries including time to execute and links to EXPLAIN each query
Modifying your code: (remember Querysets are Lazy)
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
t_2014 = traffic.filter(created__year='2014')
t_sessions_2014_wd2 = t_2014.filter(created__week_day=2).aggregate(Sum('sessions'))
...
...
For Call #1 in template (for loop to display table of all results):
{% for t in traffic.iterator %}
{{ t.sessions }}
...
...
{% endfor %}
As for Question 1, it shouldn't be a problem to reuse the queryset from the first call.
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
This should spare you an additional call to the database.
Regarding Question 2, you can again reuse the queryset from the first call, and filter the year, which gives you a new queryset, e.g.
traffic_2014 = traffic.filter(created__year='2014')
You can then continue filtering the days and aggregating with this new queryset, like you did before, or create new querysets for each day, assuming you aggregate multiple attributes every day, thus saving you another dozen database calls.
I hope this helps you.
Not addressing the questions directly, but I think you should consider a different approach.
Based on my understanding:
The view may be requested often.
The data should change rarely.
There is a need for complicated data manipulation (summing fields by year, month, day etc.)
There is no need to perform the same queries each time someone requests the view.
Load all the data in one step and perform manipulations inside the view. You can use a library like Pandas and create complicated data sets. The view will be now CPU bound so use a caching system like Redis to avoid recalculating. Invalidate when the data has changed.
Another approach: perform the calculations periodically by using a task queue like Celery and populate Redis.

Django - very slow query

I've made a module which parses xml file and updates or creates data in django db (pgsql).
When the data import/update is done I try to update some meta data of my objects.
I use django-mptt for tree structures and my meta-data updater is for creating such structures between my objects.
It's really really slow it takes about 1 second to populate parent with data from other foreignkey.
How do I optimise this?
for index, place in enumerate(Place.objects.filter(type=Place.TOWN, town_id_equal=True)):
place.parent = place.second_order_division
place.save()
print index
if index % 5000 == 0:
transaction.commit()
transaction.commit()
transaction.set_autocommit(False)
for index, place in enumerate(Place.objects.filter(type=Place.TOWN, town_id_equal=False,
parent__isnull=True)):
place.parent = Place.objects.get(town_id=place.town_id_extra)
place.save()
print index
if index % 5000 == 0:
transaction.commit()
transaction.commit()
class Place(MPTTModel):
first_order_division = models.ForeignKey("self", null=True, blank=True, verbose_name=u"Województwo",
related_name="voivodeships")
second_order_division = models.ForeignKey("self", null=True, blank=True, verbose_name=u"Powiat",
related_name="counties")
parent = TreeForeignKey('self', null=True, blank=True, related_name='children')
Edit:
I updated first function like this:
transaction.set_autocommit(False)
for index, obj in enumerate(Place.objects.filter(type=Place.COUNTY)):
data = Place.objects.filter(second_order_division=obj, type=Place.TOWN, town_id_equal=True)
data.update(parent=obj)
print index
transaction.commit()
Instead of using loop you should do bulk updates like
for first transaction you can replace your transaction with this Django query:
Place.objects.filter(type=Place.TOWN, town_id_equal=True).update(parent=F('second_order_division'))
For second transaction we can not apply bulk update because of again query on Place model.
for this you should do something to save hitting 'Place.objects.get(town_id=place.town_id_extra)' query each time in loop.
or can take help from this blog
Answering a more general question, one tactic to improve performance of almost any type of system is:
Minimize interaction between the dynamic parts of your system
That's it: minimize interaction through HTTP requests, database queries, etc. In your case, you are doing multiple queries to your database that can be easily reduced to fewer (perhaps one or two).

Categories

Resources