Django query optimization for 3 related tables

Django query optimization for 3 related tables - python

I have 4 models:
class Run(models.Model):
start_time = models.DateTimeField(db_index=True)
end_time = models.DateTimeField()
chamber = models.ForeignKey(Chamber, on_delete=models.CASCADE)
recipe = models.ForeignKey(Recipe, default=None, blank=True, null=True, on_delete=models.CASCADE)
class RunProperty(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE)
property_name = models.CharField(max_length=50)
property_value = models.CharField(max_length=500)
class RunValue(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE)
run_parameter = models.ForeignKey(RunParameter, on_delete=models.CASCADE)
value = models.FloatField(default=0)
class RunParameter(models.Model):
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
chamber = models.ForeignKey(Chamber, on_delete=models.CASCADE)
param_name_user_defined = models.BooleanField(default=True)
A Run can have any number of RunProperty (usually user defined properties, can be custom), and a few predefined RunValue (such as Average Voltage, Minimum Voltage, Maximum Voltage) that are numeric values.
The RunParameter is basically just a container of parameter names (Voltage, Current, Frequency, Temperature, Impedance, Oscillation, Variability, etc, there's a ton of them.
When I build a front end table to show each Run along with all of its "File" RunProperty (where the Run came from) and all of its "Voltage" RunValue, I first query the DB for all Run objects, then do an additional 3 queries for the Min/Max/Avg, and then another query for the File, then I build a dict on the backend to pass to the front to build the table rows:
runs = Run.objects.filter(chamber__in=chambers)
min_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Minimum Voltage")
max_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Maximum Voltage")
avg_v_run_values = RunValue.objects.filter(run__in=runs, run_parameter__parameter__parameter_name__icontains="Average Voltage")
run_files = RunProperty.objects.filter(run__in=runs, property_name="File")
This is not such a big problem for customer with ~10 to 30 Run objects in their database, but we have one heavy usage customer who has 3500 Run instances. Needless to say, it's far, far too slow. I'm doing 5 queries to get all the needed instances, and then I have to loop and put them together into one dict. It takes upwards of 45 seconds to do this for that one customer (and about 8 or 10 for most other customers).
Is there a way that I can query my DB for all Run objects along with all of the Min/Max/Avg Voltage RunValue and the File RunProperty and return, say, a list of dicts, one for each Run along with the other objects?
I think Q queries can be used here, but I'm not quite sure HOW to use them, or if they are applicable for this scenario?
I tried this (but didn't get far):
runs = Run.objects.filter(chamber__in=chambers)
v_query = Q(run_parameter__parameter__parameter_name__icontains="Voltage")
run_values = RunValue.objects.filter(run__in=runs).filter(v_query)
run_files = RunProperty.objects.filter(run__in=runs, property_name="File")
That gets me all the RunValue related objects in 1 query, but it's still 3 queries per. I need to optimize this much more, if possible.
I am looking for something along the lines of:
runs = Run.objects.filter(chamber__in=chambers)
.annotate(Q(run__runvalue__run_parameter__parameter__parameter_name__icontains="Voltage")
& Q(run__runproperty__property_name__icontains="File"))
I think in very broad terms (not even pseudocode) I would need a query like:
"Get all Runs, and for each Run, get all the RunValue objects related to that Run that contain ["Average", "Maximum", "Minimum"] and also all the RunProperty objects for that Run that contain "File".
I don't know if it's possible (sounds like it should be), and I'm not sure whether I should use Q filtering, aggregates or annotation. In broad terms, I need to get all instances of one model, along with all foreign keys for each instance, in one query, if possible
Example:
I have table Run with 2 instances:
R1
R2
Each Run instance has an associated RunProperty instance "File" (just a string) for each:
R1_run.dat
R2_run.dat
EachRun instance has many RunValue instances (I am using Voltage as an example, but there's 26 of them):
R1_max_v
R1_min_v
R1_avg_v
R2_max_v
R2_min_v
R2_avg_v
I would need to query the DB such that it returns (list or dict, I can work around either):
[{R1, R1_run.dat, R1_max_v, R1_min_v, R1_avg_v},
{R2, R2_run.dat, R2_max_v, R2_min_v, R2_avg_v}]
Or a 2D array even:
[[R1, R1_run.dat, R1_max_v, R1_min_v, R1_avg_v],
[R2, R2_run.dat, R2_max_v, R2_min_v, R2_avg_v]]
Is this even possible?

From database perspective, you can get all the data you need using just a single query with a few joins:
-- This assumes that there is a primary key Run.id and
-- foreign keys RunValue.run_id and RunProperty.run_id.
-- IDs or names of min/max/avg run parameters, as well as
-- chamber ids are replaced with *_PARAMETER and CHAMBER_IDS
-- for brevity.
SELECT Run.*,
RVmin.value AS min_value,
RVmax.value AS max_value,
RVavg.value AS avg_value,
RP.value AS file_value
FROM Run
JOIN RunValue RVmin ON Run.id = RVmin.run_id
JOIN RunValue RVmax ON Run.id = RVmax.run_id
JOIN RunValue RVavg ON Run.id = RVavg.run_id
JOIN RunProperty RP ON Run.id = RP.run_id
WHERE
RVmin.run_parameter = MIN_PARAMETER AND
RVmax.run_parameter = MAX_PARAMETER AND
RVavg.run_parameter = AVG_PARAMETER AND
RP.property_name = 'File' AND
Run.chamber IN (CHAMBER_IDS);
Django way of building such joins must be something like Run.runvalue_set.filter(run_parameter__contains 'Maximum Voltage')
See "following relationships backward": https://docs.djangoproject.com/en/2.2/topics/db/queries/#following-relationships-backward

You can get this in query by using annotate, Min, Max, Avg.
For your problem. You can do this.
Add related name in ForeignKey fields.
class RunProperty(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE, related_name="run_prop_name")
class RunValue(models.Model):
run = models.ForeignKey(Run, on_delete=models.CASCADE, related_name="run_value_name")
run_parameter = models.ForeignKey(RunParameter, on_delete=models.CASCADE)
value = models.FloatField(default=0)
views.py
from django.db.models import Avg, Max, Min
filt = 'run_value_name__value'
query = Run.objects.annotate(run_avg = Avg(filt), run_max = Max(filt))
You can get all values:
for i in query:
print(i.run_avg, i.run_max, i.run_min )
-----------Edit------------
Please check I have added "related_name" in RunValue model.
let's assume you two values in Run model.
1) run_1
2) run_2
in model RunValue, 6 entries.
run = 1, run_parameter = "Avg_value", value = 50
run = 1, run_parameter = "Min_value", value = 25
run = 1, run_parameter = "Max_value", value = 75
run = 2, run_parameter = "Avg_value", value = 28
run = 2, run_parameter = "Max_value", value = 40
run = 2, run_parameter = "Min_value", value = 16
you want dictionary something like this:
{'run_1': {'Avg_value': 50, 'Min_value': 25, 'Max_value': 75}, 'run_2': {...}}
Do this remember to read select_related and prefetch_related for documentation.
rt = Rub.objects.all().prefetch_related('run_value_name')
s = {} # output dictionary
for i in rt:
s[i.run] = {} # run dictionary
for j in i.run_value_name.all():
s[i.run].update({j.run_parameter: j.value}) # update run dictionary
print(s)
----------Addition-----------
Check number of database hit by this code.
from django.db import connection, reset_queries
print(len(connection.queries))
reset_queries()

Related

How to join tables in Django 1.8

I have tried to access joined data in my django template but nothing works, a little help is deeply appreciated.
Model1():
project_code = Foreignkey(another table1)
shot_code = charfield(primary_key = True)
shot_name = charfield()
sequence_name = Integerfield()
Model2():
vendor_id = Foreignkey(another table2)
shot_code = Foreignkey(Model1, on_delete= models.CASCADE)
shot_rate = integerfield()
shot_bid = integerfield()
I wanted to display
Select * from Model1 a, Model2 b, where a.shot_code = b.shot_code
and model1.project_code = "XXX"
and columns to be accessed in template are
1. Shot code
2. Shot name
3. Sequence name
4. Shot rate
5. Shot bid
6. Vendor id
I tried the following method
1. Using Select_related
result = only values of model2 is displayed
unable to access model1's data,
error = 'QuerySet' object has no attribute model1

Do you expect this to return one or multiple instances? The best way to do this would be still with select_related, e.g.:
Model2.objects.filter(shot_code__project_code=<your value>).select_related("shot_code")
For queryset with multiple Model2 instances, or add .get() at the end if you expect only single instance.
Alternatively, you can add .values() and instead of operating on two related models, get dict-like join result (although note that you won't be able to reuse shot_code straightforward, as it would clash with your foreign key name):
Model2.objects.filter(shot_code__project_code=<your value>).annotate(
sequence_name=F("shot_code__sequence_name"),
shot_name=F("shot_code__shot_name"),
real_shot_code=F("shot_code__shot_code")
).values(
"sequence_name", "shot_name", "real_shot_code", "shot_rate", "shot_bid", "vendor_id"
)
And as always, I recommend to refrain from naming your ForeignKey as vendor_id, since it will place the real id under the vendor_id_id, and naming will be a bit unclear.

You can use object of Model1 query set in Model2 and get the data see below example:
model1obj = Model1.objects.get(project_code = "XXX")
model2obj = Model2.objects.get(shot_code = model1obj)
# now access all the fields using model1obj and model2obj

Django aggregation taking a lot of time

I have a model defined as bellow
class Image(model.Models):
# Stages
STAGE_TRAIN = 'train'
STAGE_VAL = 'val'
STAGE_TEST = 'test'
STAGE_TRASH = 'trash'
STAGE_CHOICES = (
(STAGE_TRAIN, 'Train'),
(STAGE_VAL, 'Validation'),
(STAGE_TEST, 'Test'),
(STAGE_TRASH, 'Trash'),
)
stage = models.CharField(max_length=5, choices=STAGE_CHOICES, default=STAGE_TRAIN)
commit = models.ForeignKey(Commit, on_delete=models.CASCADE, related_name="images", related_query_name="image")
In my database I have 170k images and I try to have an endpoint that will count all the images by stage
Currently I have something like that
base_query = Image.objects.filter(commit=commit_uuid).only('id', 'stage')
count_query = base_query.aggregate(count_train=Count('id', filter=Q(stage='train')),
count_val=Count('id', filter=Q(stage='val')),
count_trash=Count('id', filter=Q(stage='trash')))
but it takes around 40sec and when I try to see the SQL request in my shell I have something that looks ok
{'sql': 'SELECT COUNT("image"."id") FILTER (WHERE "image"."stage" = \'train\') AS "count_train", COUNT("image"."id") FILTER (WHERE "image"."stage" = \'val\') AS "count_val", COUNT("image"."id") FILTER (WHERE "image"."stage" = \'trash\') AS "count_trash" FROM "image" WHERE "image"."commit_id" = \'333681ff-886a-42d0-b88a-5d38f1e9fe94\'::uuid', 'time': '42.140'}
an other strange thing is that if I change my aggregate function with
count_query = base_query.aggregate(count_train=Count('id', filter=Q(stage='train')&Q(commit=commit_uuid)),
count_val=Count('id', filter=Q(stage='val')&Q(commit=commit_uuid)),
count_trash=Count('id', filter=Q(stage='trash')&Q(commit=commit_uuid)))
When I do that the query is twice as fast (still 20sec) and when I display the SQL I see that the filter on the commit is done inside the FILTER
So I have two questions:
Can I do something different to improve the speed of the query or should I store the count somewhere and change the values each time I change an image ?
I was expecting the query to filter first on the commit id and then on the stage but I have the feeling that it's done the otherway around

1) You can add the fields indices either with index_together option
class Image(model.Models):
class Meta:
index_together = [['stage'], ['stage', 'commit']]
or the indexes option (cf https://docs.djangoproject.com/en/2.0/ref/models/options/#django.db.models.Options.indexes)
class Image(model.Models):
class Meta:
indexes = [models.Index(fields=['stage', 'commit'])]
2) You don't need the necessity to look up the id:
base_query = Image.objects.filter(commit=commit_uuid).only('stage')
# count images in stages
count = base_query.aggregate(train=Count(1, filter=Q(commit=commit_uuid) & Q(stage='train')),
val=Count(1, filter=Q(commit=commit_uuid) & Q(stage='val')),
trash=Count(1, filter=Q(commit=commit_uuid) & Q(stage='trash')))

I would try this in your model:
stage = models.CharField(max_length=5, choices=STAGE_CHOICES, default=STAGE_TRAIN, index=True)
By adding an index to stage, you should avoid full table scans.

Django ORM get jobs with top 3 scores for each model_used

Models.py:
class ScoringModel(models.Model):
title = models.CharField(max_length=64)
class PredictedScore(models.Model):
job = models.ForeignKey('Job')
candidate = models.ForeignKey('Candidate')
model_used = models.ForeignKey('ScoringModel')
score = models.FloatField()
created_at = models.DateField(auto_now_add=True)
modified_at = models.DateTimeField(auto_now=True)
serializers.py:
class MatchingJobsSerializer(serializers.ModelSerializer):
job_title = serializers.CharField(source='job.title', read_only=True)
class Meta:
model = PredictedScore
fields = ('job', 'job_title', 'score', 'model_used', 'candidate')
To fetch the top 3 jobs, I tried the following code:
queryset = PredictedScore.objects.filter(candidate=candidate)
jobs_serializer = MatchingJobsSerializer(queryset, many=True)
jobs = jobs_serializer.data
top_3_jobs = heapq.nlargest(3, jobs, key=lambda item: item['score'])
Its giving me top 3 jobs for the whole set which contains all the models.
I want to fetch the jobs with top 3 scores for a given candidate for each model used.
So, it should return the top 3 matching jobs with each ML model for the given candidate.
I followed this answer https://stackoverflow.com/a/2076665/2256258 . Its giving the latest entry of cake for each bakery, but I need the top 3.
I read about annotations in django ORM but couldn't get much about this issue. I want to use DRF serializers for this operations. This is a read only operation.
I am using Postgres as database.
What should be the Django ORM query to perform this operation?

Make the database do the work. You don't need annotations either as you want the objects, not the values or manipulated values.
To get a set of all scores for a candidate (not split by model_used) you would do:
queryset = candidate.property_set.filter(candidate=candidate).order_by('-score)[:2]
jobs_serializer = MatchingJobsSerializer(queryset, many=True)
jobs = jobs_serializer.data
What you're proposing isn't particularly well suited in the Django ORM, annoyingly - I think you may need to make separate queries for each model_used. A nicer solution (untested for this example) is to hook Q queries together, as per this answer.
Example is there is tags, but I think holds -
#lets get a distinct list of the models_used -
all_models_used = PredictedScore.objects.values('models_used').distinct()
q_objects = Q() # Create an empty Q object to start with
for m in all_models_used:
q_objects |= Q(model_used=m)[:3] # 'or' the Q objects together
queryset = PredictedScore.objects.filter(q_objects)

Django - Checking for two models if their primary keys match

I have 2 models (sett, data_parsed), and data_parsed have a foreign key to sett.
class sett(models.Model):
setid = models.IntegerField(primary_key=True)
block = models.ForeignKey(mapt, related_name='sett_block')
username = models.ForeignKey(mapt, related_name='sett_username')
ts = models.IntegerField()
def __unicode__(self):
return str(self.setid)
class data_parsed(models.Model):
setid = models.ForeignKey(sett, related_name='data_parsed_setid', primary_key=True)
block = models.CharField(max_length=2000)
username = models.CharField(max_length=2000)
time = models.IntegerField()
def __unicode__(self):
return str(self.setid)
The data_parsed model should have the same amount of rows, but there is a possibility that they are not in "sync".
To avoid this from happening. I basically do these two steps:
Check if sett.objects.all().count() == data_parsed.objects.all().count()
This works great for a fast check, and it takes literally seconds in 1 million rows.
If they are not the same, I would check for all the sett model's pk, exclude the ones already found in data_parsed.
sett.objects.select_related().exclude(
setid__in = data_parsed.objects.all().values_list('setid', flat=True)).iterator():
Basically what this does is select all the objects in sett that exclude all the setid already in data_parsed. This method "works", but it will take around 4 hours for 1 million rows.
Is there a faster way to do this?

Finding setts without data_parsed using the reverse relation:
setts.objects.filter(data_parsed_setid__isnull=True)

If i am getting it right you are trying to keep a list of processed objects in another model by setting a foreign key.
You have only one data_parsed object by every sett object, so a many to one relationship is not needed. You could use one to one relationships and then check which object has that field as empty.
With a foreign key you could try to filter using the reverse query but that is at object level so i doubt that works.

How can I improve this many-to-many Django ORM query and model set?

I have a Django query and some Python code that I'm trying to optimize because 1) it's ugly and it's not as performant as some SQL I could use to write it, and 2) because the hierarchical regrouping of the data looks messy to me.
So,
1. Is it possible to improve this to be a single query?
2. How can I improve my Python code to be more Pythonic?
Background
This is for a photo gallery system. The particular view is attempting to display the thumbnails for all photos in a gallery. Each photo is statically sized several times to avoid dynamic resizing, and I would like to also retrieve the URLs and "Size Type" (e.g. Thumbnail, Medium, Large) of each sizing so that I can Lightbox the alternate sizes without hitting the database again.
Entities
I have 5 models that are of relevance:
class Gallery(models.Model):
Photos = models.ManyToManyField('Photo', through = 'GalleryPhoto', blank = True, null = True)
class GalleryPhoto(models.Model):
Gallery = models.ForeignKey('Gallery')
Photo = models.ForeignKey('Photo')
Order = models.PositiveIntegerField(default = 1)
class Photo(models.Model):
GUID = models.CharField(max_length = 32)
class PhotoSize(models.Model):
Photo = models.ForeignKey('Photo')
PhotoSizing = models.ForeignKey('PhotoSizing')
PhotoURL = models.CharField(max_length = 1000)
class PhotoSizing(models.Model):
SizeName = models.CharField(max_length = 20)
Width = models.IntegerField(default = 0, null = True, blank = True)
Height = models.IntegerField(default = 0, null = True, blank = True)
Type = models.CharField(max_length = 10, null = True, blank = True)
So, the rough idea is that I would like to get all Photos in a Gallery through GalleryPhoto, and for each Photo, I want to get all the PhotoSizes, and I would like to be able to loop through and access all this data through a dictionary.
A rough sketch of the SQL might look like this:
Select PhotoSize.PhotoURL
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Where Gallery.id = 5
Order By GalleryPhoto.Order Asc
I would like to turn this into a list that has a schema like this:
(
photo: {
'guid': 'abcdefg',
'sizes': {
'Thumbnail': 'http://mysite/image1_thumb.jpg',
'Large': 'http://mysite/image1_full.jpg',
more sizes...
}
},
more photos...
)
I currently have the following Python code (it doesn't exactly mimic the schema above, but it'll do for an example).
gallery_photos = [(photo.Photo_id, photo.Order) for photo in GalleryPhoto.objects.filter(Gallery = gallery)]
photo_list = list(PhotoSize.objects.select_related('Photo', 'PhotoSizing').filter(Photo__id__in=[gallery_photo[0] for gallery_photo in gallery_photos]))
photos = {}
for photo in photo_list:
order = 1
for gallery_photo in gallery_photos:
if gallery_photo[0] == photo.Photo.id:
order = gallery_photo[1] //this gets the order column value
guid = photo.Photo.GUID
if not guid in photos:
photos[guid] = { 'Photo': photo.Photo, 'Thumbnail': None, 'Sizes': [], 'Order': order }
photos[guid]['Sizes'].append(photo)
sorted_photos = sorted(photos.values(), key=operator.itemgetter('Order'))
The Actual Question, Part 1
So, my question is first of all whether I can do my many-to-many query better so that I don't have to do the double query for both gallery_photos and photo_list.
The Actual Question, Part 2
I look at this code and I'm not too thrilled with the way it looks. I sure hope there's a better way to group up a hierarchical queryset result by a column name into a dictionary. Is there?

When you have sql query, that is hard to write using orm - you can use postgresql views. Not sure about mysql. In this case you will have:
Raw SQL like:
CREATE VIEW photo_urls AS
Select
photo.id, --pseudo primary key for django mapper
Gallery.id as gallery_id,
PhotoSize.PhotoURL as photo_url
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Order By GalleryPhoto.Order Asc
Django model like:
class PhotoUrls(models.Model):
class Meta:
managed = False
db_table = 'photo_urls'
gallery_id = models.IntegerField()
photo_url = models.CharField()
ORM Queryset like:
PhotoUrls.objects.filter(gallery_id=5)
Hope it will help.

Django has some built in functions that will clean up the way your code looks. It will result in subqueries, so I guess it depends on performance. https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.values
gallery_photos = GalleryPhoto.objects.filter(Gallery=gallery).values('Photo_id', 'Order')
photo_queryset = PhotoSize.objects.selected_related('Photo', 'PhotoSizing').filter(
Photo__id__in=gallery_photos.values_list('Photo_id', flat=True))
calling list() will instantly evaluate the queryset, this might affect performance if you have a lot of data.
Additionally, there should be a rather easy way to get rid of if gallery_photo[0] == photo.Photo.id: This seems like it can be easily resolved with another query, getting gallery_photos for all photos.

You can retrieve all data with a single query, and get a list of data dictionaries. Then you can manage this dictionary or create a new one to form your final dictionary... You can use reverse relations in filtering and selecting specific rows from a table... So:
Letx be your selected Galery...
GalleryPhoto.objexts.filter(Galery=x).values('Order', 'Photo__GUID', 'Photo__Photo__PhotoURL', 'Photo__Photo__PhotoSizing__SizeName', 'Photo__Photo__PhotoSizing__Width', 'Photo__Photo__PhotoSizing__Height', 'Photo__Photo__PhotoSizing__Type')
Using Photo__ will create an inner join to Photo table while Photo__Photo__ will create inner join to PhotoSize (via reverse relation) and Photo__Photo__PhotoSizing__ will inner join to PhotoSizing....
You get a list of dictionaries:
[{'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...}, {'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...},....]
You can select rows that you need and get all values as a list of dictionaries... Then you can Write a loop function or iterator to loop through this list and create a new dictionary whit grouping your data...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django query optimization for 3 related tables - python

Related

How to join tables in Django 1.8

Django aggregation taking a lot of time

Django ORM get jobs with top 3 scores for each model_used

Django - Checking for two models if their primary keys match

How can I improve this many-to-many Django ORM query and model set?

Categories

Resources