Django aggregation taking a lot of time - python

I have a model defined as bellow
class Image(model.Models):
# Stages
STAGE_TRAIN = 'train'
STAGE_VAL = 'val'
STAGE_TEST = 'test'
STAGE_TRASH = 'trash'
STAGE_CHOICES = (
(STAGE_TRAIN, 'Train'),
(STAGE_VAL, 'Validation'),
(STAGE_TEST, 'Test'),
(STAGE_TRASH, 'Trash'),
)
stage = models.CharField(max_length=5, choices=STAGE_CHOICES, default=STAGE_TRAIN)
commit = models.ForeignKey(Commit, on_delete=models.CASCADE, related_name="images", related_query_name="image")
In my database I have 170k images and I try to have an endpoint that will count all the images by stage
Currently I have something like that
base_query = Image.objects.filter(commit=commit_uuid).only('id', 'stage')
count_query = base_query.aggregate(count_train=Count('id', filter=Q(stage='train')),
count_val=Count('id', filter=Q(stage='val')),
count_trash=Count('id', filter=Q(stage='trash')))
but it takes around 40sec and when I try to see the SQL request in my shell I have something that looks ok
{'sql': 'SELECT COUNT("image"."id") FILTER (WHERE "image"."stage" = \'train\') AS "count_train", COUNT("image"."id") FILTER (WHERE "image"."stage" = \'val\') AS "count_val", COUNT("image"."id") FILTER (WHERE "image"."stage" = \'trash\') AS "count_trash" FROM "image" WHERE "image"."commit_id" = \'333681ff-886a-42d0-b88a-5d38f1e9fe94\'::uuid', 'time': '42.140'}
an other strange thing is that if I change my aggregate function with
count_query = base_query.aggregate(count_train=Count('id', filter=Q(stage='train')&Q(commit=commit_uuid)),
count_val=Count('id', filter=Q(stage='val')&Q(commit=commit_uuid)),
count_trash=Count('id', filter=Q(stage='trash')&Q(commit=commit_uuid)))
When I do that the query is twice as fast (still 20sec) and when I display the SQL I see that the filter on the commit is done inside the FILTER
So I have two questions:
Can I do something different to improve the speed of the query or should I store the count somewhere and change the values each time I change an image ?
I was expecting the query to filter first on the commit id and then on the stage but I have the feeling that it's done the otherway around

1) You can add the fields indices either with index_together option
class Image(model.Models):
class Meta:
index_together = [['stage'], ['stage', 'commit']]
or the indexes option (cf https://docs.djangoproject.com/en/2.0/ref/models/options/#django.db.models.Options.indexes)
class Image(model.Models):
class Meta:
indexes = [models.Index(fields=['stage', 'commit'])]
2) You don't need the necessity to look up the id:
base_query = Image.objects.filter(commit=commit_uuid).only('stage')
# count images in stages
count = base_query.aggregate(train=Count(1, filter=Q(commit=commit_uuid) & Q(stage='train')),
val=Count(1, filter=Q(commit=commit_uuid) & Q(stage='val')),
trash=Count(1, filter=Q(commit=commit_uuid) & Q(stage='trash')))

I would try this in your model:
stage = models.CharField(max_length=5, choices=STAGE_CHOICES, default=STAGE_TRAIN, index=True)
By adding an index to stage, you should avoid full table scans.

Related

How can i change this query to ORM?

Hi i have two models like this,
class Sample(models.Model):
name = models.CharField(max_length=256) ##
processid = models.IntegerField(default=0) #
class Process(models.Model):
sample = models.ForeignKey(Sample, blank=False, null=True, on_delete=models.SET_NULL, related_name="process_set")
end_at = models.DateTimeField(null=True, blank=True)
and I want to join Sample and Process model. Because Sample is related to process and I want to get process information with sample .
SELECT sample.id, sample.name, process.endstat
FROM sample
INNER JOIN process
ON sample.processid = process.id
AND process.endstat = 1;
(i'm using SQLite)
I used
sample_list = sample_list.filter(process_set__endstat=1))
but it returned
SELECT sample.id, sample.name
FROM sample
INNER JOIN process
ON (sample.id = process.sample_id)
AND process.endstat = 1)
This is NOT what I want.
How can i solve the problem?
This should work for you
Process.objects.filter(end_at=1).values('sample__id','sample__name','end_at')
.values() method returns selective table fields.
I'm assuming sample_list = Sample.objects.
When you are filtering a model, only the fields defined in the model are selected. In your example, id and processid. If you want to retrieve values from related models as a single record you need to use values or values_list. To get the desired query you have to do this
sample_list = sample_list.filter(process_set__endstat=1).values('id', 'name', 'process__endstat')
Btw, Django does JOIN on the foreign key field. So, you can't get ON sample.processid = process.id since processid is not a ForeignKey field.
Reference:
https://docs.djangoproject.com/en/4.0/ref/models/querysets/#values
I found JOIN not on foreign key field in django.
sample_list = sample_list.filter(processid__in=Process.objects.filter(endstat=1)
I used the medthod of
Django-queryset join without foreignkey

Django query optimization for 3 related tables

I have 4 models:
class Run(models.Model):
start_time = models.DateTimeField(db_index=True)
end_time = models.DateTimeField()
chamber = models.ForeignKey(Chamber, on_delete=models.CASCADE)
recipe = models.ForeignKey(Recipe, default=None, blank=True, null=True, on_delete=models.CASCADE)
class RunProperty(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE)
property_name = models.CharField(max_length=50)
property_value = models.CharField(max_length=500)
class RunValue(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE)
run_parameter = models.ForeignKey(RunParameter, on_delete=models.CASCADE)
value = models.FloatField(default=0)
class RunParameter(models.Model):
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
chamber = models.ForeignKey(Chamber, on_delete=models.CASCADE)
param_name_user_defined = models.BooleanField(default=True)
A Run can have any number of RunProperty (usually user defined properties, can be custom), and a few predefined RunValue (such as Average Voltage, Minimum Voltage, Maximum Voltage) that are numeric values.
The RunParameter is basically just a container of parameter names (Voltage, Current, Frequency, Temperature, Impedance, Oscillation, Variability, etc, there's a ton of them.
When I build a front end table to show each Run along with all of its "File" RunProperty (where the Run came from) and all of its "Voltage" RunValue, I first query the DB for all Run objects, then do an additional 3 queries for the Min/Max/Avg, and then another query for the File, then I build a dict on the backend to pass to the front to build the table rows:
runs = Run.objects.filter(chamber__in=chambers)
min_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Minimum Voltage")
max_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Maximum Voltage")
avg_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Average Voltage")
run_files = RunProperty.objects.filter(run__in=runs, property_name="File")
This is not such a big problem for customer with ~10 to 30 Run objects in their database, but we have one heavy usage customer who has 3500 Run instances. Needless to say, it's far, far too slow. I'm doing 5 queries to get all the needed instances, and then I have to loop and put them together into one dict. It takes upwards of 45 seconds to do this for that one customer (and about 8 or 10 for most other customers).
Is there a way that I can query my DB for all Run objects along with all of the Min/Max/Avg Voltage RunValue and the File RunProperty and return, say, a list of dicts, one for each Run along with the other objects?
I think Q queries can be used here, but I'm not quite sure HOW to use them, or if they are applicable for this scenario?
I tried this (but didn't get far):
runs = Run.objects.filter(chamber__in=chambers)
v_query = Q(run_parameter__parameter__parameter_name__icontains="Voltage")
run_values = RunValue.objects.filter(run__in=runs).filter(v_query)
run_files = RunProperty.objects.filter(run__in=runs, property_name="File")
That gets me all the RunValue related objects in 1 query, but it's still 3 queries per. I need to optimize this much more, if possible.
I am looking for something along the lines of:
runs = Run.objects.filter(chamber__in=chambers)
.annotate(Q(run__runvalue__run_parameter__parameter__parameter_name__icontains="Voltage")
& Q(run__runproperty__property_name__icontains="File"))
I think in very broad terms (not even pseudocode) I would need a query like:
"Get all Runs, and for each Run, get all the RunValue objects related to that Run that contain ["Average", "Maximum", "Minimum"] and also all the RunProperty objects for that Run that contain "File".
I don't know if it's possible (sounds like it should be), and I'm not sure whether I should use Q filtering, aggregates or annotation. In broad terms, I need to get all instances of one model, along with all foreign keys for each instance, in one query, if possible
Example:
I have table Run with 2 instances:
R1
R2
Each Run instance has an associated RunProperty instance "File" (just a string) for each:
R1_run.dat
R2_run.dat
EachRun instance has many RunValue instances (I am using Voltage as an example, but there's 26 of them):
R1_max_v
R1_min_v
R1_avg_v
R2_max_v
R2_min_v
R2_avg_v
I would need to query the DB such that it returns (list or dict, I can work around either):
[{R1, R1_run.dat, R1_max_v, R1_min_v, R1_avg_v},
{R2, R2_run.dat, R2_max_v, R2_min_v, R2_avg_v}]
Or a 2D array even:
[[R1, R1_run.dat, R1_max_v, R1_min_v, R1_avg_v],
[R2, R2_run.dat, R2_max_v, R2_min_v, R2_avg_v]]
Is this even possible?
From database perspective, you can get all the data you need using just a single query with a few joins:
-- This assumes that there is a primary key Run.id and
-- foreign keys RunValue.run_id and RunProperty.run_id.
-- IDs or names of min/max/avg run parameters, as well as
-- chamber ids are replaced with *_PARAMETER and CHAMBER_IDS
-- for brevity.
SELECT Run.*,
RVmin.value AS min_value,
RVmax.value AS max_value,
RVavg.value AS avg_value,
RP.value AS file_value
FROM Run
JOIN RunValue RVmin ON Run.id = RVmin.run_id
JOIN RunValue RVmax ON Run.id = RVmax.run_id
JOIN RunValue RVavg ON Run.id = RVavg.run_id
JOIN RunProperty RP ON Run.id = RP.run_id
WHERE
RVmin.run_parameter = MIN_PARAMETER AND
RVmax.run_parameter = MAX_PARAMETER AND
RVavg.run_parameter = AVG_PARAMETER AND
RP.property_name = 'File' AND
Run.chamber IN (CHAMBER_IDS);
Django way of building such joins must be something like Run.runvalue_set.filter(run_parameter__contains 'Maximum Voltage')
See "following relationships backward": https://docs.djangoproject.com/en/2.2/topics/db/queries/#following-relationships-backward
You can get this in query by using annotate, Min, Max, Avg.
For your problem. You can do this.
Add related name in ForeignKey fields.
class RunProperty(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE, related_name="run_prop_name")
class RunValue(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE, related_name="run_value_name")
run_parameter = models.ForeignKey(RunParameter, on_delete=models.CASCADE)
value = models.FloatField(default=0)
views.py
from django.db.models import Avg, Max, Min
filt = 'run_value_name__value'
query = Run.objects.annotate(run_avg = Avg(filt), run_max = Max(filt))
You can get all values:
for i in query:
print(i.run_avg, i.run_max, i.run_min )
-----------Edit------------
Please check I have added "related_name" in RunValue model.
let's assume you two values in Run model.
1) run_1
2) run_2
in model RunValue, 6 entries.
run = 1, run_parameter = "Avg_value", value = 50
run = 1, run_parameter = "Min_value", value = 25
run = 1, run_parameter = "Max_value", value = 75
run = 2, run_parameter = "Avg_value", value = 28
run = 2, run_parameter = "Max_value", value = 40
run = 2, run_parameter = "Min_value", value = 16
you want dictionary something like this:
{'run_1': {'Avg_value': 50, 'Min_value': 25, 'Max_value': 75}, 'run_2': {...}}
Do this remember to read select_related and prefetch_related for documentation.
rt = Rub.objects.all().prefetch_related('run_value_name')
s = {} # output dictionary
for i in rt:
s[i.run] = {} # run dictionary
for j in i.run_value_name.all():
s[i.run].update({j.run_parameter: j.value}) # update run dictionary
print(s)
----------Addition-----------
Check number of database hit by this code.
from django.db import connection, reset_queries
print(len(connection.queries))
reset_queries()

Perform lookup and update within a single Django query

I have two models: MetaModel and RelatedModel. I want to include the result of a RelatedModel lookup within a MetaModel query, and I'd like to do this within a single DB call.
I've tried to define a 'subquery' QuerySet for use in the main query, but that hasn't worked - it's still making two queries to complete the operation.
Note: I can't use a traditional ForeignKey relationship because the profile_id field is not unique. Uniqueness is a combination of profile_id and channel. This is an aggregation table and profile_id is not guaranteed to be unique across multiple third-party channels.
Any suggestions?
Models:
class Channel(models.Model):
id = models.AutoField(primary_key=True)
name = models.CharField(
max_length=25,
)
class MetaModel(models.Model):
profile_id = fields.IntegerField()
channel = fields.ForeignKey(Channel))
metadata = fields.TextField()
class RelatedModel(models.Model):
related_id = fields.IntegerField()
profile_id = fields.IntegerField()
channel = fields.ForeignKey(Channel))
Dummy data
channel = Channel("Web site A")
channel.save()
sample_meta = MetaModel(profile_id=1234, channel=channel)
sample_related = RelatedModel(profile_id=1234, related_id=5678, channel=channel)
Query:
# Create a queryset to filter down to the single record we need the `profile_id` for
# I've limited to the only field we need via a `values` operation
related_qs = RelatedAccount.objects.filter(
related_id=5678,
channel=channel
).values_list("profile_id", flat=True)
# I'm doing an update_or_create as there is other data to store, not included for brevity
obj, created = MetaModel.objects.update_or_create(
profile_id=related_qs.first(), # <<< This var is the dynamic part of the query
channel=channel,
defaults={"metadata": "Metadata is added to a new or existing record."}
)
Regarding your note on uniqueness, you can use unique_together option in Django as described here in the documentation.
class MetaModel(models.Model):
profile_id = fields.ForeignKey(RelatedModel)
channel = fields.ForeignKey(Channel)
metadata = fields.TextField()
class Meta:
unique_together = ('profile_id', 'channel')
Then you can change your query accordingly and should solve your problem.

Django ORM for given group by SQL query with aggregation method sum and count

I have below given Django model
class ABC(models.Model):
user = models.ForeignKey(DEF)
name = models.CharField()
phone_num = models.CharField()
date = models.DateTimeField(auto_now=True)
amount = models.IntegerField()
I want to perform below query using Django ORM.
select *, sum(amount), count(date) from ABC group by phone_num;
I tried code below, but it does not work.
ABC.objects.all().annotate(count = Count("phone_num")).order_by("phone_num")
Not sure whether it possible to grub data you mentioned above ( Select *, sum(amount), count( date ) by simple order by, probab;y that's JOIN query, at least you could try variants below and perform some intersection by phone_num on ABC.all():
ABC.objects.values("phone_num").order_by().annotate(count = Count("date"), amount= Sum("amount"))
Notes:
values('phone_num') - for GROUP BY 'phone_num' clause.
order_by() - for exclusion possible default ordering which ( you could remove that order_by().
p.s.
Also try to run query below:
ABC.objects.all().values("phone_num").annotate(count = Count("date"), amount= Sum("amount"))
Update
You could do next loop to grub desired data as Django ORM solution is absent:
data = (dict(o, data=ABC.objects.filter(phone_num=o['phone_num'])[:1][0]) for o in ABC.objects
.values("phone_num")
.order_by()
.annotate(count = Count("date"), amount= Sum("amount")).all())
// know you could access your data in next way:
for item in data:
phone_num = item['phone_num']
count = item['count']
amount = item['amount']
id = item['data'].id
name = item['data'].name
// Do other staff...
Note
data formed with generator expression(comprehension)

How can I improve this many-to-many Django ORM query and model set?

I have a Django query and some Python code that I'm trying to optimize because 1) it's ugly and it's not as performant as some SQL I could use to write it, and 2) because the hierarchical regrouping of the data looks messy to me.
So,
1. Is it possible to improve this to be a single query?
2. How can I improve my Python code to be more Pythonic?
Background
This is for a photo gallery system. The particular view is attempting to display the thumbnails for all photos in a gallery. Each photo is statically sized several times to avoid dynamic resizing, and I would like to also retrieve the URLs and "Size Type" (e.g. Thumbnail, Medium, Large) of each sizing so that I can Lightbox the alternate sizes without hitting the database again.
Entities
I have 5 models that are of relevance:
class Gallery(models.Model):
Photos = models.ManyToManyField('Photo', through = 'GalleryPhoto', blank = True, null = True)
class GalleryPhoto(models.Model):
Gallery = models.ForeignKey('Gallery')
Photo = models.ForeignKey('Photo')
Order = models.PositiveIntegerField(default = 1)
class Photo(models.Model):
GUID = models.CharField(max_length = 32)
class PhotoSize(models.Model):
Photo = models.ForeignKey('Photo')
PhotoSizing = models.ForeignKey('PhotoSizing')
PhotoURL = models.CharField(max_length = 1000)
class PhotoSizing(models.Model):
SizeName = models.CharField(max_length = 20)
Width = models.IntegerField(default = 0, null = True, blank = True)
Height = models.IntegerField(default = 0, null = True, blank = True)
Type = models.CharField(max_length = 10, null = True, blank = True)
So, the rough idea is that I would like to get all Photos in a Gallery through GalleryPhoto, and for each Photo, I want to get all the PhotoSizes, and I would like to be able to loop through and access all this data through a dictionary.
A rough sketch of the SQL might look like this:
Select PhotoSize.PhotoURL
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Where Gallery.id = 5
Order By GalleryPhoto.Order Asc
I would like to turn this into a list that has a schema like this:
(
photo: {
'guid': 'abcdefg',
'sizes': {
'Thumbnail': 'http://mysite/image1_thumb.jpg',
'Large': 'http://mysite/image1_full.jpg',
more sizes...
}
},
more photos...
)
I currently have the following Python code (it doesn't exactly mimic the schema above, but it'll do for an example).
gallery_photos = [(photo.Photo_id, photo.Order) for photo in GalleryPhoto.objects.filter(Gallery = gallery)]
photo_list = list(PhotoSize.objects.select_related('Photo', 'PhotoSizing').filter(Photo__id__in=[gallery_photo[0] for gallery_photo in gallery_photos]))
photos = {}
for photo in photo_list:
order = 1
for gallery_photo in gallery_photos:
if gallery_photo[0] == photo.Photo.id:
order = gallery_photo[1] //this gets the order column value
guid = photo.Photo.GUID
if not guid in photos:
photos[guid] = { 'Photo': photo.Photo, 'Thumbnail': None, 'Sizes': [], 'Order': order }
photos[guid]['Sizes'].append(photo)
sorted_photos = sorted(photos.values(), key=operator.itemgetter('Order'))
The Actual Question, Part 1
So, my question is first of all whether I can do my many-to-many query better so that I don't have to do the double query for both gallery_photos and photo_list.
The Actual Question, Part 2
I look at this code and I'm not too thrilled with the way it looks. I sure hope there's a better way to group up a hierarchical queryset result by a column name into a dictionary. Is there?
When you have sql query, that is hard to write using orm - you can use postgresql views. Not sure about mysql. In this case you will have:
Raw SQL like:
CREATE VIEW photo_urls AS
Select
photo.id, --pseudo primary key for django mapper
Gallery.id as gallery_id,
PhotoSize.PhotoURL as photo_url
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Order By GalleryPhoto.Order Asc
Django model like:
class PhotoUrls(models.Model):
class Meta:
managed = False
db_table = 'photo_urls'
gallery_id = models.IntegerField()
photo_url = models.CharField()
ORM Queryset like:
PhotoUrls.objects.filter(gallery_id=5)
Hope it will help.
Django has some built in functions that will clean up the way your code looks. It will result in subqueries, so I guess it depends on performance. https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.values
gallery_photos = GalleryPhoto.objects.filter(Gallery=gallery).values('Photo_id', 'Order')
photo_queryset = PhotoSize.objects.selected_related('Photo', 'PhotoSizing').filter(
Photo__id__in=gallery_photos.values_list('Photo_id', flat=True))
calling list() will instantly evaluate the queryset, this might affect performance if you have a lot of data.
Additionally, there should be a rather easy way to get rid of if gallery_photo[0] == photo.Photo.id: This seems like it can be easily resolved with another query, getting gallery_photos for all photos.
You can retrieve all data with a single query, and get a list of data dictionaries. Then you can manage this dictionary or create a new one to form your final dictionary... You can use reverse relations in filtering and selecting specific rows from a table... So:
Letx be your selected Galery...
GalleryPhoto.objexts.filter(Galery=x).values('Order', 'Photo__GUID', 'Photo__Photo__PhotoURL', 'Photo__Photo__PhotoSizing__SizeName', 'Photo__Photo__PhotoSizing__Width', 'Photo__Photo__PhotoSizing__Height', 'Photo__Photo__PhotoSizing__Type')
Using Photo__ will create an inner join to Photo table while Photo__Photo__ will create inner join to PhotoSize (via reverse relation) and Photo__Photo__PhotoSizing__ will inner join to PhotoSizing....
You get a list of dictionaries:
[{'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...}, {'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...},....]
You can select rows that you need and get all values as a list of dictionaries... Then you can Write a loop function or iterator to loop through this list and create a new dictionary whit grouping your data...

Categories

Resources