I have very large dataset and growing, and I need to create many filters but it is going to quickly get out of control and was hoping someone can help me combine some of the queries into a single call. Below is the start of my view.
Call #1 - for loop to display table of all results
traffic = Traffic.objects.all()
Call #2 - Combined aggregate sum query
totals = Traffic.objects.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
Call #3, #4, #5, #6 and so on... - To filter the database by month and day of week
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=2).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=3).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=4).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=5).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=6).aggregate(Sum('sessions'))
The problem is , I need to create several dozen more filters because I have 3 years of data with multiple data points per column that we need totals the sum for.
Questions:
Can I combine call #1 into call #2
Can I use Call #2 to query the sums for call#3 so I don't have to call all objects from the database to filter it and then do this a couple more dozen times?
As you can see, this is going to get out of control very quickly. Any help would be hugely appreciated. Thank you.
Updated to add
Traffic Model
class Timestamp(models.Model):
created = models.DateField()
class Meta:
abstract = True
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True)
new_users = models.IntegerField(blank=True, null=True)
reminder = models.IntegerField(blank=True, null=True)
campaigns = models.IntegerField(blank=True, null=True)
new_sales = models.IntegerField(blank=True, null=True)
sales_renewals = models.IntegerField(blank=True, null=True)
# Meta and String
class Meta:
verbose_name = 'Traffic'
verbose_name_plural = 'Traffic Data'
def __str__(self):
return "%s" % self.created
There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them. Here's some quick tips for query optimization:
1) iterator()
If you are accessing the queryset only once. So for example you can use this as,
traffic = Traffic.objects.all()
for t in traffic.iterator():
...
...
2) db_index=True
While defining fields of your models. As the Django documentation says,
This is a number one priority, after you have determined from
profiling what indexes should be added. Use Field.db_index or
Meta.index_together to add these from Django. Consider adding indexes
to fields that you frequently query using filter(), exclude(),
order_by(), etc. as indexes may help to speed up lookups.
Hence you can modify your model as,
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True, db_index=True)
new_users = models.IntegerField(blank=True, null=True, db_index=True)
reminder = models.IntegerField(blank=True, null=True, db_index=True)
campaigns = models.IntegerField(blank=True, null=True, db_index=True)
new_sales = models.IntegerField(blank=True, null=True, db_index=True)
3) prefetch_related() or select_related()
If you have relations within your models, using prefetch_related or select_related would be a choice. As per the Django documentation,
select_related works by creating a SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a 'many' relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related.
select_related does a join, prefetch_related does two separate queries. Using these you can make your queries upto 30% faster.
4) Django Pagination
If your template design allows you to display results in multiple pages your can use Pagination.
5) Querysets are Lazy
You also need to understand that the Django Querysets are lazy which means that it won't query the database untill its being used/evaluated. A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example,
traffic = Traffic.objects.all()
The above code doesn’t run any database queries. You can can take the traffic queryset and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications. To fetch the data from the database, you need to iterate over the queryset:
for t in traffic.iterator():
print(t.sessions)
6) django-debug-toolbar
Django Debug Toolbar is a configurable set of panels that display various debug information about the current request/response and when clicked, display more details about the panel's content. This includes:
Request timer
SQL queries including time to execute and links to EXPLAIN each query
Modifying your code: (remember Querysets are Lazy)
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
t_2014 = traffic.filter(created__year='2014')
t_sessions_2014_wd2 = t_2014.filter(created__week_day=2).aggregate(Sum('sessions'))
...
...
For Call #1 in template (for loop to display table of all results):
{% for t in traffic.iterator %}
{{ t.sessions }}
...
...
{% endfor %}
As for Question 1, it shouldn't be a problem to reuse the queryset from the first call.
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
This should spare you an additional call to the database.
Regarding Question 2, you can again reuse the queryset from the first call, and filter the year, which gives you a new queryset, e.g.
traffic_2014 = traffic.filter(created__year='2014')
You can then continue filtering the days and aggregating with this new queryset, like you did before, or create new querysets for each day, assuming you aggregate multiple attributes every day, thus saving you another dozen database calls.
I hope this helps you.
Not addressing the questions directly, but I think you should consider a different approach.
Based on my understanding:
The view may be requested often.
The data should change rarely.
There is a need for complicated data manipulation (summing fields by year, month, day etc.)
There is no need to perform the same queries each time someone requests the view.
Load all the data in one step and perform manipulations inside the view. You can use a library like Pandas and create complicated data sets. The view will be now CPU bound so use a caching system like Redis to avoid recalculating. Invalidate when the data has changed.
Another approach: perform the calculations periodically by using a task queue like Celery and populate Redis.
Related
I'm trying to return a list of users that have recently made a post, but the order_by method makes it return too many items.
there is only 2 accounts total, but when I call
test = Account.objects.all().order_by('-posts__timestamp')
[print(i) for i in test]
it will return the author of every post instance, and its duplicates. Not just the two account instances.
test#test.example
test#test.example
test#test.example
test#test.example
foo#bar.example
Any help?
class Account(AbstractBaseUser):
...
class Posts(models.Model):
author = models.ForeignKey('accounts.Account',on_delete=models.RESTRICT, related_name="posts")
timestamp = models.DateTimeField(auto_now_add=True)
title = ...
content = ...
This is totally normal. You should understand how is the SQL query generated.
Yours should look something like that:
select *
from accounts
left join post on post.account_id = account.id
order by post.timestamp
You are effectively selecting every post with its related users. It is normal that you have some duplicated users.
What you could do is ensure that your are selecting distinct users: Account.objects.order_by('-posts__timestamp').distinct('pk')
What I would do is cache this information in the account (directly on the acount model or in another model that has a 1-to-1 relashionship with your users.
Adding a last_post_date to your Account model would allow you to have a less heavy request.
Updating the Account.last_post_date every time a Post is created can be a little tedious, but you can abstract this by using django models signals.
I am sitting with a query looking like this:
# Get the amount of kilo attached to products
product_data = {}
for productSpy in ProductSpy.objects.all():
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
I do not see how I on my last line would be able to use prefetch_related. In the examples in the docs it's very simplified and somehow makes sense, but I do not understand the whole concept enough to see myself out of this. Could I please get explained what's being done and how? I find this very important to understand, and where met by my first N+1 here.
Thank you up front for your time.
models.py
class ProductSpy(models.Model):
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
product = models.ForeignKey(Product, on_delete=models.CASCADE)
def __str__(self):
return self.kilo
class Product(models.Model):
product_id = models.IntegerField()
name = models.CharField(max_length=150)
def __str__(self):
return self.name
Django fetches related tables at runtime:
each call to productSpy.product will fetch from the table product using productSpy.id
The latency in I/O operation means that this code is highly inefficient. using prefetch_related will fetch product for all the product spy objects in one shot resulting in better performance.
# Get the amount of kilo attached to products
product_data = {}
product_spies = ProductSpy.objects.all()
product_spies.prefetch_related('product')
product_spies.prefetch_related('kilo')
for productSpy in product_spies:
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
When one writes productSpy.product if the related object is not already fetched, Django makes automatically will make a query to the database to get the related Product instance. Hence if ProductSpy.objects.all() returned N instances by writing productSpy.product in a loop we will be making N more queries which is what we call N + 1 problem.
Moving further although you can use prefetch_related (will use 2 queries in your case) here it would be better for you to use select_related [Django docs] which will use a LEFT JOIN and get you the related instances in 1 query itself:
product_data = {}
queryset = ProductSpy.objects.select_related('product')
for productSpy in queryset:
product_data[productSpy.product.product_id] = productSpy.kilo # No extra queries as we used `select_related`
Note: There seems to be some problem with your logic here though, as multiple ProductSpy instances can have the same Product,
hence your loop might overwrite some values.
I am trying to build a tool that, at a simple level, tries to analyse how to buy a flat. DB = POSTGRES
So the model basically is:
class Property(models.Model):
address = CharField(max_length = 200)
price = IntegerField()
user = ForeignKey(User) # user who entered the property in the database
#..
#..
# some more fields that are common across all flats
#However, users might have their own way of analysing
# one user might want to put
estimated_price = IntegerField() # his own estimate of the price, different from the zoopla or rightmove listing price
time_to_purchase = IntegerField() # his own estimate on how long it will take to purchase
# another user might want to put other fields
# might be his purchase process requires sorting or filtering based on these two fields
number_of_bedrooms = IntegerField()
previous_owner_name = CharField()
How do I give such flexiblity to users? They should be able to sort , filter and query their own rows (in the Property table) by these custom fields. The only option I can think of now is the JSONField Postgres field
Any advice? I am surprised this is not solved readily in Django - I am sure lots of other people would have come across this problem already
Thanks
Edit: As the comments point out. JSON field is a better idea in this case.
Simple. Use Relations.
Create a model called attributes.
It will have a foreign key to a Property, a name field and a value field.
Something like,
class Attribute(models.Model):
property = models.ForiegnKey(Property)
name = models.CharField(max_length=50)
value = models.CharField(max_length=150)
Create an object each for all custom attributes of a property.
When using database queries use select_related of prefetch_related for faster response, less db operations.
I'm experiencing some severe performances issues with prefetch_related on a Model with 5 m2m fields and I'm pre-fetching also few nested m2m fields.
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).prefetch_related("parent", "takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "status", "approvalWorkflow", "viewers", "requires", "linkedTasks", "activities")
class Task(models.Model):
uuid = models.UUIDField(primary_key=True, default=genOptimUUID, editable=False)
internalStatus = models.IntegerField(default=0)
parent = models.ForeignKey("self", blank=True, null=True, related_name="childs")
name = models.CharField(max_length=45)
taskType = models.ForeignKey("TaskType", null=True)
priority = models.IntegerField()
startDate = models.DateTimeField()
endDate = models.DateTimeField()
status = models.ForeignKey("ProgressionStatus")
assignedUser = models.ForeignKey("Asset", related_name="tasksAssigned")
asset = models.ForeignKey("Asset", related_name="tasksSubject")
viewers = models.ManyToManyField("Asset", blank=True, related_name="followedTasks")
step = models.ForeignKey("Step", blank=True, null=True, related_name="tasks")
approvalWorkflow = models.ForeignKey("ApprovalWorkflow")
linkedTasks = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="linkedTo")
requires = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="depends")
objects = TaskModelManager()
The number of query is fine and the database query time is fine too, for exemple if I query 700 objects of my model i have 35 query and the average query time is 100~200ms but the total request time is approximately 8 seconds.
silk times
I've run some profiling and it pointed out that more than 80% of the time spent was on the prefetch_related_objects call.
profiling
I'm using Django==1.8.5 and djangorestframework==3.4.6
I'm open to any way to optimize this.
Thanks in advance for your help.
Edit with select_related:
I've tried the improvement proposed by Alasdair
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).select_related("parent", "status", "approvalWorkflow", "step").prefetch_related("takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "viewers", "requires", "linkedTasks", "activities")
The new result is still 8 seconds for the request with 32 queries and 150ms of query time.
Edit :
It seems that a ticket was opened on Django issue tracker 4 years ago and is still open : https: //code.djangoproject.com/ticket/20577
I ran into the same problem.
Following the issue you linked i found that you can improve the prefetch_related performance using Prefetch object and to_attr argument.
From the commit that introduces the Prefetch object:
When a Prefetch instance specifies a to_attr argument, the result is
stored in a list rather than a QuerySet. This has the fortunate
consequence of being significantly faster. The preformance improvement
is due to the fact that we save the costly creation of a QuerySet
instance.
So i significantly improved my code (from about 7 seconds to 0.88 seconds) simply by calling:
Foo.objects.filter(...).prefetch_related(Prefetch('bars', to_attr='bars_list'))
instead of
Foo.objects.filter(...).prefetch_related('bars')
Try using select_related for foreign keys like parent and ApprovalWorkflow instead of prefetch_related.
When you use select_related, Django will fetch the models using a join, unlike prefetch_related which causes an extra query. You might find that this improves performance.
If the DB is 150ms but your request is 8 seconds, it's not your query (in itself, at least). A few possible issues:
1) Your HTML or template is too complex, spending too much time in generating the response. Or consider template caching.
2) All those objects are complex and you load too many fields, so while the query is fast, sending it and processing all those objects in Python is slow. Explore using only(), defer() and values() or value_list() to only load what you need.
Optimization is hard and we'd need more details to give you a better idea. I'd suggest installing Django Debug Toolbar (Django app) or Opbeat (3rd party utility), they can help you detect where your time is spent and then you can optimize accordingly.
I've made a module which parses xml file and updates or creates data in django db (pgsql).
When the data import/update is done I try to update some meta data of my objects.
I use django-mptt for tree structures and my meta-data updater is for creating such structures between my objects.
It's really really slow it takes about 1 second to populate parent with data from other foreignkey.
How do I optimise this?
for index, place in enumerate(Place.objects.filter(type=Place.TOWN, town_id_equal=True)):
place.parent = place.second_order_division
place.save()
print index
if index % 5000 == 0:
transaction.commit()
transaction.commit()
transaction.set_autocommit(False)
for index, place in enumerate(Place.objects.filter(type=Place.TOWN, town_id_equal=False,
parent__isnull=True)):
place.parent = Place.objects.get(town_id=place.town_id_extra)
place.save()
print index
if index % 5000 == 0:
transaction.commit()
transaction.commit()
class Place(MPTTModel):
first_order_division = models.ForeignKey("self", null=True, blank=True, verbose_name=u"Województwo",
related_name="voivodeships")
second_order_division = models.ForeignKey("self", null=True, blank=True, verbose_name=u"Powiat",
related_name="counties")
parent = TreeForeignKey('self', null=True, blank=True, related_name='children')
Edit:
I updated first function like this:
transaction.set_autocommit(False)
for index, obj in enumerate(Place.objects.filter(type=Place.COUNTY)):
data = Place.objects.filter(second_order_division=obj, type=Place.TOWN, town_id_equal=True)
data.update(parent=obj)
print index
transaction.commit()
Instead of using loop you should do bulk updates like
for first transaction you can replace your transaction with this Django query:
Place.objects.filter(type=Place.TOWN, town_id_equal=True).update(parent=F('second_order_division'))
For second transaction we can not apply bulk update because of again query on Place model.
for this you should do something to save hitting 'Place.objects.get(town_id=place.town_id_extra)' query each time in loop.
or can take help from this blog
Answering a more general question, one tactic to improve performance of almost any type of system is:
Minimize interaction between the dynamic parts of your system
That's it: minimize interaction through HTTP requests, database queries, etc. In your case, you are doing multiple queries to your database that can be easily reduced to fewer (perhaps one or two).