Django & Postgres - percentile (median) and group by - python

I need to calculate period medians per seller ID (see simplyfied model below). The problem is I am unable to construct the ORM query.
Model
class MyModel:
period = models.IntegerField(null=True, default=None)
seller_ids = ArrayField(models.IntegerField(), default=list)
aux = JSONField(default=dict)
Query
queryset = (
MyModel.objects.filter(period=25)
.annotate(seller_id=Func(F("seller_ids"), function="unnest"))
.values("seller_id")
.annotate(
duration=Cast(KeyTextTransform("duration", "aux"), IntegerField()),
median=Func(
F("duration"),
function="percentile_cont",
template="%(function)s(0.5) WITHIN GROUP (ORDER BY %(expressions)s)",
),
)
.values("median", "seller_id")
)
ArrayField aggregation (seller_id) source
I think what I need to do is something along the lines below
select t.*, p_25, p_75
from t join
(select district,
percentile_cont(0.25) within group (order by sales) as p_25,
percentile_cont(0.75) within group (order by sales) as p_75
from t
group by district
) td
on t.district = td.district
above example source
Python 3.7.5, Django 2.2.8, Postgres 11.1

You can create a Median child class of the Aggregate class as was done by Ryan Murphy (https://gist.github.com/rdmurphy/3f73c7b1826cacee34f6c2a855b12e2e). Median then works just like Avg:
from django.db.models import Aggregate, FloatField
class Median(Aggregate):
function = 'PERCENTILE_CONT'
name = 'median'
output_field = FloatField()
template = '%(function)s(0.5) WITHIN GROUP (ORDER BY %(expressions)s)'
Then to find the median of a field use
my_model_aggregate = MyModel.objects.all().aggregate(Median('period'))
which is then available as my_model_aggregate['period__median'].

Here's what did the trick.
from django.db.models import F, Func, IntegerField
from django.db.models.aggregates import Aggregate
queryset = (
MyModel.objects.filter(period=25)
.annotate(duration=Cast(KeyTextTransform("duration", "aux"), IntegerField()))
.filter(duration__isnull=False)
.annotate(seller_id=Func(F("seller_ids"), function="unnest"))
.values("seller_id") # group by
.annotate(
median=Aggregate(
F("duration"),
function="percentile_cont",
template="%(function)s(0.5) WITHIN GROUP (ORDER BY %(expressions)s)",
),
)
)
Notice the median annotation employs Aggregate and not Func as in the question.
Also, order of annotate() and filter() clauses as well as order of annotate() and values() clauses matters a lot!
BTW the resulting SQL is without a nested select and join.

Related

Month on month values in django query

I have an annotation like this: which displays the month wise count of a field
bar = Foo.objects.annotate(
item_count=Count('item')
).order_by('-item_month', '-item_year')
and this produces output like this:
html render
I would like to show the change in item_count when compared with the previous month item_count for each month (except the first month). How could I achieve this using annotations or do I need to use pandas?
Thanks
Edit:
In SQL this becomes easy with LAG function, which is similar to
SELECT item_month, item_year, COUNT(item),
LAG(COUNT(item)) OVER (ORDER BY item_month, item_year)
FROM Foo
GROUP BY item_month, item_year
(PS: item_month and item_year are date fields)
Do Django ORM have similar to LAG in SQL?
For these types of Query you need to use Window functions in django Orm
For Lag you can take the help of
https://docs.djangoproject.com/en/4.0/ref/models/database-functions/#lag
Working Query in Orm will look like this :
#models.py
class Review(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='review_user', db_index=True)
review_text = models.TextField(max_length=5000)
rating = models.SmallIntegerField(
validators=[
MaxValueValidator(10),
MinValueValidator(1),
],
)
date_added = models.DateTimeField(db_index=True)
review_id = models.AutoField(primary_key=True, db_index=True)
This is just a dummy table to show you the use case of Lag and Window function in django
Because examples are not available for Lag function on Django Docs.
from django.db.models.functions import Lag, ExtractYear
from django.db.models import F, Window
print(Review.objects.filter().annotate(
num_likes=Count('likereview_review')
).annotate(item_count_lag=Window(expression=Lag(expression=F('num_likes')),order_by=ExtractYear('date_added').asc())).order_by('-num_likes').distinct().query)
Query will look like
SELECT DISTINCT `temp_view_review`.`user_id`, `temp_view_review`.`review_text`, `temp_view_review`.`rating`, `temp_view_review`.`date_added`, `temp_view_review`.`review_id`, COUNT(`temp_view_likereview`.`id`) AS `num_likes`, LAG(COUNT(`temp_view_likereview`.`id`), 1) OVER (ORDER BY EXTRACT(YEAR FROM `temp_view_review`.`date_added`) ASC) AS `item_count_lag` FROM `temp_view_review` LEFT OUTER JOIN `temp_view_likereview` ON (`temp_view_review`.`review_id` = `temp_view_likereview`.`review_id`) GROUP BY `temp_view_review`.`review_id` ORDER BY `num_likes` DESC
Also if you don't want to order_by on extracted year of date then you can use F expressions like this
print(Review.objects.filter().annotate(
num_likes=Count('likereview_review')
).annotate(item_count_lag=Window(expression=Lag(expression=F('num_likes')),order_by=[F('date_added')])).order_by('-num_likes').distinct().query)
Query for this :
SELECT DISTINCT `temp_view_review`.`user_id`, `temp_view_review`.`review_text`, `temp_view_review`.`rating`, `temp_view_review`.`date_added`, `temp_view_review`.`review_id`, COUNT(`temp_view_likereview`.`id`) AS `num_likes`, LAG(COUNT(`temp_view_likereview`.`id`), 1) OVER (ORDER BY `temp_view_review`.`date_added`) AS `item_count_lag` FROM `temp_view_review` LEFT OUTER JOIN `temp_view_likereview` ON (`temp_view_review`.`review_id` = `temp_view_likereview`.`review_id`) GROUP BY `temp_view_review`.`review_id` ORDER BY `num_likes` DESC

Is Nested aggregate queries possible with Django queryset

I want to calculate the monthly based profit with the following models using django queryset methods. The tricky point is that I have a freightselloverride field in the order table. It overrides the sum of freightsell in the orderItem table. An order may contain multiple orderItems. That's why I have to calculate order based profit first and then calculate the monthly based profit. Because if there is any order level freightselloverride data I should take this into consideration.
Below I gave a try using annotate method but could not resolve how to reach this SQL. Does Django allow this kind of nested aggregate queries?
select sales_month
,sum(sumSellPrice-sumNetPrice-sumFreighNet+coalesce(FreightSellOverride,sumFreightSell)) as profit
from
(
select CAST(DATE_FORMAT(b.CreateDate, '%Y-%m-01 00:00:00') AS DATETIME) AS `sales_month`,
a.order_id,b.FreightSellOverride
,sum(SellPrice) as sumSellPrice,sum(NetPrice) as sumNetPrice
,sum(FreightNet) as sumFreighNet,sum(FreightSell) as sumFreightSell
from OrderItem a
inner join Order b
on a.order_id=b.id
group by 1,2,3
) c
group by sales_month
I tried this
result = (OrderItem.objects
.annotate(sales_month=TruncMonth('order__CreateDate'))
.values('sales_month','order','order__FreightSellOverride')
.annotate(sumSellPrice=Sum('SellPrice'),sumNetPrice=Sum('NetPrice'),sumFreighNet=Sum('FreightNet'),sumFreightSell=Sum('FreightSell'))
.values('sales_month')
.annotate(profit=Sum(F('sumSellPrice')-F('sumNetPrice')-F('sumFreighNet')+Coalesce('order__FreightSellOverride','sumFreightSell')))
)
but get this error
Exception Type: FieldError
Exception Value:
Cannot compute Sum('<CombinedExpression: F(sumSellPrice) - F(sumNetPrice) - F(sumFreighNet) + Coalesce(F(ProjectId__FreightSellOverride), F(sumFreightSell))>'): '<CombinedExpression: F(sumSellPrice) - F(sumNetPrice) - F(sumFreighNet) + Coalesce(F(ProjectId__FreightSellOverride), F(sumFreightSell))>' is an aggregate
from django.db import models
from django.db.models import F, Count, Sum
from django.db.models.functions import TruncMonth, Coalesce
class Order(models.Model):
CreateDate = models.DateTimeField(verbose_name="Create Date")
FreightSellOverride = models.FloatField()
class OrderItem(models.Model):
SellPrice = models.DecimalField(max_digits=10,decimal_places=2)
FreightSell = models.DecimalField(max_digits=10,decimal_places=2)
NetPrice = models.DecimalField(max_digits=10,decimal_places=2)
FreightNet = models.DecimalField(max_digits=10,decimal_places=2)
order = models.ForeignKey(Order,on_delete=models.DO_NOTHING,related_name="Item")

Django create subquery with values for last n days

I am using Django 3.1 with Postgres, and this is my abridged model:
class PlayerSeasonReport:
player = models.ForeignKey(Player)
competition_season = models.ForeignKey(CompetitionSeason)
class PlayerPrice:
player_season_report = models.ForeignKey(PlayerSeasonReport)
price = models.IntegerField()
date = models.DateTimeField()
# unique on (price, date)
I'm querying on the PlayerSeasonReport to get aggregate information about all players, in particular I would like the prices for the last n records (so the last price, the 7th-to-last price, etc.)
I currently get the PlayerSeasonReport queryset and annotate it like this:
base_query = PlayerSeasonReport.objects.filter(competition_season_id=id)
# This works fine
last_value = base_query.filter(
pk=OuterRef('pk'),
).order_by(
'pk',
'-player_prices__date'
).distinct('pk').annotate(
value=F('player_prices__price')
)
# Pull the value from a week ago
# This produces a value but is logically incorrect
# I am interested in the 7th-to-last value, not really from a week ago from day of query
week_ago = datetime.datetime.now() - datetime.timedelta(7)
value_7d_ago = base_query.filter(
pk=OuterRef('pk'),
player_prices__date__gte=week_ago,
).order_by(
'pk',
'fantasy_player_prices__date'
).distinct('pk').annotate(
value=F('player_prices__price')
)
return base_query.annotate(
value=Subquery(
value.values('value'),
output_field=FloatField()
),
# Same for value_7d_ago
# ...
# Many other annotations
)
Getting the most recent value works fine, but getting the last n values doesn't. I shouldn't be using datetime concepts in my logic, since what I'm really interested in is in the n-to-last values.
I've tried annotating the max date, then filtering based on this annotation, and also somehow slicing the subquery, but I can't seem to get any of it right.
It's worth noting that a price may not exist (there may be no record for n values in the past), in which case it should be null (the annotation based on datetime works)
How can I annotate the price values for the last n days?
Sorted:
base_query = PlayerSeasonReport.objects.filter(id=id)
# ...other manipulations on base query
prices = PlayerPrice.objects.filter(
fantasy_player_season_report=OuterRef('pk')
).order_by('-date')
return base_query.annotate(
price=Subquery(
prices.values('price')[:1],
output_field=FloatField()
),
prev_day_price=Subquery(
prices.values('price')[1:2],
output_field=FloatField()
),
# ...
)
Explanation:
We query on the child model (PlayerPrice) and join on the pk of the PlayerSeasonReport.
prices.values('price')[i:j] where j = i + 1 allows us to get the value we desire without evaluating the QuerySet (which is indispensable in a Subquery).

how to use django annotate with foreign key

Consider simple Django models
class Journey(models.Model):
vrn=models.CharField(max_length=200) # Vehicle Reg No
kilo=models.FloatField()
class J_user(models.Model):
jdi=models.ForeignKey(Journey, related_name="Journey_User",on_delete = models.DO_NOTHING,)
uid=models.IntegerField()
It's easy to annotate in a single table like if we want sum total driven kilometers for each vehicle (vrn represent registration number of the vehicle)
Journey.objects.values('vrn').annotate(Total_kilo=Sum('kilo'))
Now i want to make a query that will return how many kilometers each user has traveled in each car.
Let Data of Journey table
Data of J_user table
Then the result should be
Thanks for your help.
This is your query:
Journey
.objects
.order_by() #<-- important to avoid include sort fields
.values('vrn', 'j_user__uid', )
.annotate(Total_kilo=Sum('kilo'))
Fields on values will be included on the aggregation clause. Sample:
print(
Material
.objects
.values( "uf_id", "uf__mp__id", )
.annotate( Sum("total_social_per_c") )
.query )
Result:
SELECT "material_material"."uf_id",
"ufs_uf"."mp_id",
Sum("material_material"."total_social_per_c") AS
"total_social_per_c__sum"
FROM "material_material"
INNER JOIN "ufs_uf"
ON ( "material_material"."uf_id" = "ufs_uf"."id" )
GROUP BY "material_material"."uf_id",
"ufs_uf"."mp_id"
According to your models it should be:
J_user.objects.values('uid', vrn=F('jdi__vrn')).annotate(kilo=Sum('jdi__kilo'))

Django ORM: sort by aggregate of filter of related table

Here's a subset of my model:
class Case(models.Model):
... # primary key is named "id"
class Employee(models.Model):
... # primary key is named "id"
class Report(models.Model):
case = ForeignKey(Case, null=True)
employee = ForeignKey(Employee)
date = DateField()
Given a particular employee, I want to produce a list of all cases, ordered by when the employee has most recently reported on it. Those cases for which no report exists should be sorted last. Cases on the same date (including NULL) should be sorted by further criteria.
Can I express this in the Django ORM api? If so, how?
In pseudo-SQL, I think I want
Select Case.*
From Case some-kind-of-join Report
Where report.employee_id = the_given_employee_id
Group by Case.id
Order by Max(Report.date) Desc /* Report-less cases last */, Case.id /* etc. */
Do I need to introduce a many-to-many relation from Case to Employee through Report to do this in Django ORM?
Every relationship in a django model has a reverse relationship that can be easily queried (including when you are ordering) so you can do something like:
Case.objects.all().order_by('-report__date', 'another_field', 'a third field')
but this won't get you any information about a single particular employee. You could do this:
Case.objects.filter(report__employee__pk=5).order_by('-report__date', 'another_field', 'a third field')
but this won't return any Case objects that aren't edited by your particular employee.
So unfortunately, you can't natively do subqueries, so you will have to write a custom annotation query so perform the sub query (i.e. the last order dates for those objects last edited by a particular employee). This is untested, but it's the general idea:
Case \
.objects \
.all() \
.extra(select = {
"employee_last_edit" : """
SELECT app_report.date
FROM app_report
JOIN app_case ON app_case__id = app_report.case_id
WHERE app_report.employee_id = %d
""" % employee.id }) \
.order_by('-employee_last_edit' , 'something_else')

Categories

Resources