Problem Overview
Given the models
class Candidate(BaseModel):
name = models.CharField(max_length=128)
class Status(BaseModel):
name = models.CharField(max_length=128)
class StatusChange(BaseModel):
candidate = models.ForeignKey("Candidate", related_name="status_changes")
status = models.ForeignKey("Status", related_name="status_changes")
created_at = models.DateTimeField(auto_now_add=True, blank=True)
And SQL Tables:
candidates
+----+--------------+
| id | name |
+----+--------------+
| 1 | Beth |
| 2 | Mark |
| 3 | Mike |
| 4 | Ryan |
+----+--------------+
status
+----+--------------+
| id | name |
+----+--------------+
| 1 | Review |
| 2 | Accepted |
| 3 | Rejected |
+----+--------------+
status_change
+----+--------------+-----------+------------+
| id | candidate_id | status_id | created_at |
+----+--------------+-----------+------------+
| 1 | 1 | 1 | 03-01-2019 |
| 2 | 1 | 2 | 05-01-2019 |
| 4 | 2 | 1 | 01-01-2019 |
| 5 | 3 | 1 | 01-01-2019 |
| 6 | 4 | 3 | 01-01-2019 |
+----+--------------+-----------+------------+
I want to get the get the total number of candidates with a given status, but only the latest status_change is counted.
In other words, StatusChange is used to track history of status, but only the latest is considered when counting current status of candidates.
SQL Solution
Using SQL, I was able to achieve it using Group BY and COUNT.
(SQL untested)
SELECT
status.id as status_id
, status.name as status_name
, COUNT(*) as status_count
FROM
(
SELECT
status_id,
Max(created_at) AS latest_status_change
FROM
status_change
GROUP BY status_id
)
AS last_status_count
INNER JOIN
last_status_count AS status
ON (last_status_count.status_id = status.id)
GROUP BY status.name
ORDER BY status_count DESC;
last_status_count
+-----------+-------------+--------+
| status_id | status_name | count |
+-----------+-------------+--------+
| 1 | Review | 2 | # <= Does not include instance from candidate 1
| 2 | Accepted | 1 | # because status 2 is latest
| 3 | Rejected | 1 |
+-----------+-------------+--------+
Attempted Django Solution
I need a view to return each status and their corresponding count -
eg [{ status_name: "Review", count: 2 }, ...]
I am not sure how to build this queryset, without pulling all records and aggregating in python.
I figured I need annotate() and possibly Subquery but I haven't been able to stitch it all together.
The closest I got is this, which counts the number of status change for each status but does counts non-latest changes.
queryset = Status.objects.all().annotate(case_count=Count("status_changes"))
I have found lot's of SO questions on aggregating, but I couldn't find a clear answer on aggregating and annotating "latest.
Thanks in advance.
We can perform a query where we first filter the last StatusChanges per Candidate and then count the statusses:
from django.db.models import Count, F, Max
Status.objects.filter(
status_changes__in=StatusChange.objects.annotate(
last=Max('candidate__status_changes__created_at')
).filter(
created_at=F('last')
)
).annotate(
nlast=Count('status_changes')
)
For the given sample data, this gives us:
>>> [(q.name, q.nlast) for q in qs]
[('Review', 2), ('Accepted', 1), ('Rejected', 1)]
Related
Information
I have two models:
class BookingModel(models.Model):
[..fields..]
class BookingComponentModel(models.Model):
STATUS_CHOICES = ['In Progress','Completed','Not Started','Incomplete','Filled','Partially Filled','Cancelled']
STATUS_CHOICES = [(choice,choice) for choice in STATUS_CHOICES]
COMPONENT_CHOICES = ['Test','Soak']
COMPONENT_CHOICES = [(choice,choice) for choice in COMPONENT_CHOICES]
booking = models.ForeignKey(BookingModel, on_delete=models.CASCADE, null=True, blank=True)
component_type = models.CharField(max_length=20, choices=COMPONENT_CHOICES)
status = models.CharField(max_length=50, choices=STATUS_CHOICES, default='Not Started')
order = models.IntegerField(unique=True)
[..fields..]
What I want
I want to get the booking component for each booking which has the last value (maximum) in order. It will also need to have a status='In Progress' and component_type='Soak'.
For example for table:
+----+------------+----------------+-------------+-------+
| id | booking_id | component_type | status | order |
+----+------------+----------------+-------------+-------+
| 1 | 1 | Test | Completed | 1 |
+----+------------+----------------+-------------+-------+
| 2 | 1 | Soak | Completed | 2 |
+----+------------+----------------+-------------+-------+
| 3 | 1 | Soak | In Progress | 3 |
+----+------------+----------------+-------------+-------+
| 4 | 2 | Test | Completed | 1 |
+----+------------+----------------+-------------+-------+
| 5 | 2 | Soak | In Progress | 2 |
+----+------------+----------------+-------------+-------+
| 6 | 3 | Test | In Progress | 1 |
+----+------------+----------------+-------------+-------+
Expected outcome would be id's: 4 & 6
What I've tried
I've tried the following:
BookingComponentModel.objects.values('booking').annotate(max_order=Max('order')).order_by('-booking')
This doesn't include the filtering but returns the max_order for each booking.
I would need the id of the component which has that max_order in order to put this in a sub-query and enable me to filter other conditions (status, component_type)
Thanks
You can make use of a Subquery expression [Django-doc] and work with:
from django.db.models import OuterRef, Subquery
BookingModel.objects.annotate(
latest_component_id=Subquery(BookingComponentModel.objects.filter(
booking_id=OuterRef('pk'), status='In Progress', component_type='Soak'
).values('pk').order_by('-order')[:1])
)
The BookingModel objects that arise from this queryset will have an extra attribute latest_component_id that will contain the primary key of the latest BookingComponentModel with as status 'In Progress', and as component_type 'Soak'.
I want to query with window function and then do some group by aggregation on the subquery. But I couldn't make it with ORM method. It will return aggregate function calls cannot contain window function calls
Is there any way to make a query like SQL below without using .raw()
SELECT a.col_id, AVG(a.max_count) FROM (
SELECT col_id,
MAX(count) OVER (PARTITION BY part_id ORDER BY part_id) AS max_count
FROM table_one
) a
GROUP BY a.col_id;
Example
table_one
| id | col_id | part_id | count |
| -- | ------ | ------- | ----- |
| 1 | c1 | p1 | 3 |
| 2 | c2 | p1 | 2 |
| 3 | c3 | p2 | 1 |
| 4 | c2 | p2 | 4 |
First I want to get the max base on the part_id
| id | col_id | part_id | count | max_count |
| -- | ------ | ------- | ----- | --------- |
| 1 | c1 | p1 | 3 | 3 |
| 2 | c2 | p1 | 2 | 3 |
| 3 | c3 | p2 | 1 | 4 |
| 4 | c2 | p2 | 4 | 4 |
And finally get the avarage of max_count group by col_id
| col_id | avg(max_count) |
| ------ | -------------- |
| c1 | 3 |
| c2 | 3.5 |
| c3 | 4 |
The models I have now
def Part(models.Model):
part_id = models.UUIDField(primary_key=True, editable=False, default=uuid.uuid4)
name = models.CharFields()
def Col(models.Model):
part_id = models.UUIDField(primary_key=True, editable=False, default=uuid.uuid4)
name = models.CharFields()
def TableOne(models.Model):
id = models.UUIDField(primary_key=True, editable=False, default=uuid.uuid4)
col_id = models.ForeignKey(
Col,
on_delete=models.CASCADE,
related_name='table_one_col'
)
part_id = models.ForeignKey(
Part,
on_delete=models.CASCADE,
related_name='table_one_part'
)
count = models.IntegerField()
I want to do group by after the partition by. This is the query I did which will bring error.
query = TableOne.objects.annotate(
max_count=Window(
expression=Max('count'),
order_by=F('part_id').asc(),
partition_by=F('part_id')
)
).values(
'col_id'
).annotate(
avg=Avg('max_count')
)
You can use subqueries in Django, you don't need to use window functions. First the subquery is a Part queryset that is annotated with the max count from TableOne
from django.db.models import Avg, Max, Subquery, OuterRef
parts = Part.objects.filter(
id=OuterRef('part_id')
).annotate(
max=Max('table_one_part__count')
)
Then annotate a TableOne queryset with the max count from the subquery, perform values on the column we want to group by (col_id ) and then annotate again with the average to generate your desired output
TableOne.objects.annotate(
max_count=Subquery(parts.values('max')[:1])
).values(
'col_id'
).annotate(
Avg('max_count')
)
This question is a follow up question for this SO question : Django Annotated Query to Count Only Latest from Reverse Relationship
Given these models:
class Candidate(BaseModel):
name = models.CharField(max_length=128)
class Status(BaseModel):
name = models.CharField(max_length=128)
class StatusChange(BaseModel):
candidate = models.ForeignKey("Candidate", related_name="status_changes")
status = models.ForeignKey("Status", related_name="status_changes")
created_at = models.DateTimeField(auto_now_add=True, blank=True)
Represented by these tables:
candidates
+----+--------------+
| id | name |
+----+--------------+
| 1 | Beth |
| 2 | Mark |
| 3 | Mike |
| 4 | Ryan |
+----+--------------+
status
+----+--------------+
| id | name |
+----+--------------+
| 1 | Review |
| 2 | Accepted |
| 3 | Rejected |
+----+--------------+
status_change
+----+--------------+-----------+------------+
| id | candidate_id | status_id | created_at |
+----+--------------+-----------+------------+
| 1 | 1 | 1 | 03-01-2019 |
| 2 | 1 | 2 | 05-01-2019 |
| 4 | 2 | 1 | 01-01-2019 |
| 5 | 3 | 1 | 01-01-2019 |
| 6 | 4 | 3 | 01-01-2019 |
+----+--------------+-----------+------------+
I wanted to get a count of each status type, but only include the last status for each candidate:
last_status_count
+-----------+-------------+--------+
| status_id | status_name | count |
+-----------+-------------+--------+
| 1 | Review | 2 |
| 2 | Accepted | 1 |
| 3 | Rejected | 1 |
+-----------+-------------+--------+
I was able to achieve this with this answer:
from django.db.models import Count, F, Max
Status.objects.filter(
status_changes__in=StatusChange.objects.annotate(
last=Max('candidate__status_changes__created_at')
).filter(
created_at=F('last')
)
).annotate(
nlast=Count('status_changes')
)
>>> [(q.name, q.nlast) for q in qs]
[('Review', 2), ('Accepted', 1), ('Rejected', 1)]
The issue however, is if there is a status not reference by any status change, it's omitted from the result. Instead, I would like to count it as zero.
For example, if the status were
+----+--------------+
| id | name |
+----+--------------+
| 1 | Review |
| 2 | Accepted |
| 3 | Rejected |
| 4 | Banned |
+----+--------------+
I would get:
+-----------+-------------+--------+
| status_id | status_name | count |
+-----------+-------------+--------+
| 1 | Review | 2 |
| 2 | Accepted | 1 |
| 3 | Rejected | 1 |
| 4 | Banned | 0 |
+-----------+-------------+--------+
>>> [(q.name, q.nlast) for q in qs]
[('Review', 2), ('Accepted', 1), ('Rejected', 1), ('Accepted 0)]
What I tried
I solved this by doing an outer join in SQL but I am not sure how to achieve that in Djano.
I tried creating a queryset with all counts annotated as zero and the merging it, but it did not work:
last_status_changes = Status.objects.filter(
status_changes__in=StatusChange.objects.annotate(
last=Max('candidate__status_changes__created_at')
).filter(
created_at=F('last')
)
).annotate(
nlast=Count('status_changes')
)
zero_query = (
Status.objects.all()
.annotate(nlast=Value(0, output_field=IntegerField()))
.exclude(pk__in=last_status_changes.values("id"))
)
>>> qs = last_status_changes | zero_query
>>> [(q.name, q.nlast) for q in qs]
[('Review', 3), ('Accepted', 1), ('Rejected', 1)]
# this would double count "Review" and include not only last but others
Any help is appreciated
Thanks
Update 1
I was able to solve this with a Raw Query using a right join, but would be great to do this using the ORM
# Untested as I am using different model names in reality
SQL = """SELECT
Min(status.id) as id
, COUNT(latest_status_change.candidate_id) as status_count
FROM
(
SELECT
candidate_id,
Max(created_at) AS latest_date
FROM
api_status_change
GROUP BY candidate_id
)
AS latest_status_change
INNER JOIN api_candidates ON (latest_status_change.candidate_id = api_candidates.id)
INNER JOIN api_status_change ON
(
latest_status_change.candidate_id = api_candidates.id
AND
latest_status_change.latest_date = api_status_change.created_at
)
RIGHT JOIN api_status AS status ON (api_status_change.status_id = `status`.id)
GROUP BY status.name
;
"""
qs = Status.objects.raw(SQL)
>>> [(q.name, q.nlast) for q in qs]
[('Review', 2), ('Accepted', 1), ('Rejected', 1), ('Accepted 0)]
The only one problem here is that you are filtering your State queryset by existing status changes and expecting complete opposite results. In your case the solution is to get rid of obsolete filtering
last_status_changes = Status.objects.annotate(
nlast=Count('status_changes')
).order_by(
'-nlast'
)
The other case would be if you want really filter you changes (by date for example)
changed_status_ids = Status.objects.filter(
status_changes__created_at__gte='2020-03-03'
).values_list(
'id',
flat=True
)
Status.objects.annotate(
c=Count('status_changes')
).annotate(
cnt=Case(
When(
id__in=changed_status_ids,
then=F('c')
),
output_field=models.IntegerField(),
default=0
)
).values(
'cnt',
'name'
).order_by(
'-cnt'
)
I solved it with the queryset below:
qs_last_status_changes = StatusChanges.objects
.annotate(
_last_change=models.Max("candidate__status_changes__create_at")
).filter(created_at=models.F("_last_change")
qs_status = Status.objects\
.annotate(count=models.Sum(
models.Case(
models.When(
status_changes__in=qs_last_status_changes,
then=models.Value(1)
),
output_field=models.IntegerField(),
default=0,
)
)
)
>>> [(k.name, k.count) for k in qs_status]
[('Review', 2), ('Accepted', 1), ('Rejected', 1), ('Accepted 0)]
Thank you Andrey Nelubin for your suggestion
I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.
I have a 'master table' that contains just one column of ids from all my other tables. I also have several other tables that contain some of the ids, along with other columns of data. I am trying to iterate through all of the ids for each smaller table, create a new column for the smaller table, check if the id exists on that table and create a binary entry in the master table. (0 if the id doesn't exist, and 1 if the id does exist on the specified table)
That seems pretty confusing, but the application of this is to check if a user exists on the table for a specific date, and keep track of this information day to day.
Right now my I am iterating through the dates, and inside each iteration I am iterating through all of the ids to check if they exist for that date. This is likely going to be incredibly slow, and there is probably a better way to do this though. My code looks like this:
def main():
dates = init()
id_list = getids()
print(dates)
for date in reversed(dates):
cursor.execute("ALTER TABLE " + table + " ADD " + date + " BIT;")
cnx.commit()
for ID in id_list:
(...)
I know that the next step will be to generate a query using each id that looks something like:
SELECT id FROM [date]_table
WHERE EXISTS (SELECT 1 FROM master_table WHERE master_table.id = [date]_table.id)
I've been stuck on this problem for a couple days and so far I cannot come up with a query that gives a useful result.
.
For an example, if I had three tables for three days...
Monday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1002 | ... |
| 1003 | ... |
| 1004 | ... |
| 1005 | ... |
+------+-----+
Tuesday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1003 | ... |
| 1005 | ... |
+------+-----+
Wednesday:
+------+-----+
| id | ... |
+------+-----+
| 1002 | ... |
| 1004 | ... |
+------+-----+
I'd like to end up with a master table like this:
+------+--------+---------+-----------+
| id | monday | tuesday | wednesday |
+------+--------+---------+-----------+
| 1001 | 1 | 1 | 0 |
| 1002 | 1 | 0 | 1 |
| 1003 | 1 | 1 | 0 |
| 1004 | 1 | 0 | 1 |
| 1005 | 1 | 1 | 0 |
+------+--------+---------+-----------+
Thank you ahead of time for any help with this issue. And since it's sort of a confusing problem, please let me know if there are any additional details I can provide.