How to implement cross join in django for a count annotation - python

I present a simplified version of my problem. I have venues and timeslots and users and bookings, as shown in the model descriptions below. Time slots are universal for all venues, and users can book into a time slot at a venue up until the venue capacity is reached.
class Venue(models.Model):
name = models.Charfield(max_length=200)
capacity = models.PositiveIntegerField(default=0)
class TimeSlot(models.Model):
start_time = models.TimeField()
end_time = models.TimeField()
class Booking(models.Model):
user = models.ForeignKey(User)
time_slot = models.ForeignKey(TimeSlot)
venue = models.ForeignKey(Venue)
Now I would like to as efficiently as possible get all possible combinations of Venues and TimeSlots and annotate the count of the bookings made for each combination, including the case where the number of bookings is 0.
I have managed to achieve this in raw SQL using a cross join on the Venue and TimeSlot tables. Something to the effect of the below. However despite exhaustive searching have not been able to find a django equivalent.
SELECT venue.name, timeslot.start_time, timeslot.end_time, count(booking.id)
FROM myapp_venue as venue
CROSS JOIN myapp_timeslot as timeslot
LEFT JOIN myapp_booking as booking on booking.time_slot_id = timeslot.id
GROUP BY venue.name, timeslot.start_time, timeslot.end_time
I'm also able to annotate the query to retrieve the count of bookings for which bookings for that combination do exist. But those combinations with 0 bookings get excluded. Example:
qs = Booking.objects.all().values(
venue=F('venue__name'),
start_time=F('time_slot__start_time'),
end_time=F('time_slot__end_time')
).annotate(bookings=Count('id')) \
.order_by('venue', 'start_time', 'end_time')
How can I achieve the effect of the CROSS JOIN query using the django ORM?

I don't believe Django has the capability to do cross joins without reverting down to raw SQL. I can give you two ideas that could point you in the right direction though:
Combination of queries and python loops.
venues = Venue.objects.all()
time_slots = TimeSlot.objects.all()
qs = ** your customer query above **
# Loop through both querysets, to create a master list.
venue_time_slots = []
for venue in venues:
for time_slot in time_slots:
venue_time_slots.append(venue.name, time_slot.start_time, time_slot.end_time, 0)
# Loop through master list and then compare to custom qs to update the count.
for venue_time in venue_time_slots:
for vt in qs:
# Check if venue and time found.
if venue_time[0] == qs.venue and venue_time[1] == qs.start_time:
venue_time[3] += qs.bookings
break
The harder one which I don't have a solution is to use a combination of filter, exclude, and union. I only have used this with 3 tables (two parents with a child-link-table), where you have 4 including user. So I can only provide the logic and not an example.
# Get all results that exist in table using .filter().
first_query.filter()
# Get all results that do not exist by using .exclude().
# You can use your results from the first query to exclude also, but
# would need to create an interim list.
exclude_ids = [fq_row.id for fq_row in first_query]
second_query.exclude(id__in=exclude_ids)
# Combine both queries
query = first_query.union(second_query)
return query

Related

Django ORM - Condition or Filter on LEFT JOIN

I will try to be precise with this as much as possible.
Imagine these two models. whose relation was set up years ago:
class Event(models.Model):
instance_created_date = models.DateTimeField(auto_now_add=True)
car = models.ForeignKey(Car, on_delete=models.CASCADE, related_name="car_events")
...
a lot of normal text fields here, but they dont matter for this problem.
and
class Car(models.Model):
a lot of text fields here, but they dont matter for this problem.
hide_from_company_search = models.BooleanField(default=False)
images = models.ManyToManyField(Image, through=CarImage)
Lets say I want to query the amount of events for a given car:
def get_car_events_qs() -> QuerySet:
six_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=6))
cars = Car.objects.prefetch_related(
'car_events',
).filter(
some_conditions_on_fields=False,
).annotate(
num_car_events=Count(
'car_events',
filter=Q(car_events__instance_created_date__gt=six_days_ago), distinct=True)
)
return cars
The really tricky part for this is the performance of the query: Cars has 450.000 entries, and Events has 156.850.048. All fields that I am using to query are indexed. The query takes around 4 minutes to complete, depending on the db load. It took 18 minutes before adding the indicies.
This above ORM query will result in the following sql:
SELECT
"core_car"."id",
COUNT("analytics_carevent"."id") FILTER (WHERE ("analytics_carevent"."event" = 'view'
AND "analytics_carevent"."instance_created_date" >= '2022-05-10T07:45:16.672279+00:00'::timestamptz
AND "analytics_carevent"."instance_created_date" < '2022-05-11T07:45:16.672284+00:00'::timestamptz)) AS "num_cars_view",
LEFT OUTER JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id")
WHERE
... some conditions that dont matter
GROUP BY
"core_car"."id"
I somehow suspect this FILTER to be a problem.
I tried with
.annotate(num_car_events=Count('car_events'))
and moving the car_events__instance_created_date__gt=six_days_ago into the filter:
.filter(some_conditions_on_fields=False, car_events__instance_created_date__gt=six_days_ago)
But of course this would filter out Cars with no Events, which is not what we want - but it is super fast!
I fiddled a bit with it in raw sql and came to his nice working example, that I now would like to write into ORM, since we dont really want to use rawsql. This query takes 2.2s, which is in our acceptable boundary, but faaaaar less than the 18minutes.
SELECT
"core_car"."id",
COUNT(DISTINCT "analytics_carevent"."id") AS "num_cars_view",
FROM
"core_car"
LEFT JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id" AND "analytics_carevent"."event" = 'view' AND "analytics_carevent"."instance_created_date" > '2022-05-14T00:00:00+02:00'::timestamptz
AND "analytics_carevent"."instance_created_date" <= '2022-05-15T00:00:00+02:00'::timestamptz)
WHERE (some conditions that dont matter)
GROUP BY "core_car"."id";
My question now is:
How can I make the above query into the ORM?
I need to put the "filter" or conditions onto the left join. If I just use filter() it will just put it into the where clause, which is wrong.
I tried:
two_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=2))
cars = Car.objects.prefetch_related(
'car_events',
).filter(some_filters,)
and
cars = cars.annotate(events=FilteredRelation('car_events')).filter(car_events__car_id__in=cars.values_list("id", flat=True), car_events__instance_created_date__gt=six_days_ago)
But I dont think this is quite correct. I also need the count of the annotation.
Using Django 4 and latest python release as of this writing. :)
Thanks a lot!
TLDR: Putting a filter or conditions on LEFT JOIN in django, instead of queryset.filter()

Django queryset order by latest value in related field

Consider the following Models in Django:
class Item(models.Model):
name = models.CharField(max_length = 100)
class Item_Price(models.Model):
created_on = models.DateTimeField(default = timezone.now)
item = models.ForeignKey('Item', related_name = 'prices')
price = models.DecimalField(decimal_places = 2, max_digits = 15)
The price of an item can vary throughout time so I want to keep a price history.
My goal is to have a single query using the Django ORM to get a list of Items with their latest prices and sort the results by this price in ascending order.
What would be the best way to achieve this?
You can use a Subquery to obtain the latest Item_Price object and sort on these:
from django.db.models import OuterRef, Subquery
last_price = Item_Price.objects.filter(
item_id=OuterRef('pk')
).order_by('-created_on').values('price')[:1]
Item.objects.annotate(
last_price=Subquery(last_price)
).order_by('last_price')
For each Item, we thus obtain the latest Item_Price and use this in the annotation.
That being said, the above modelling is perhaps not ideal, since it will require a lot of complex queries. django-simple-history [readthedocs.io] does this differently by creating an extra model and save historical records. It also has a manager that allows one to query for historical states. This perhaps makes working with historical dat simpeler.
You could prefetch them in order to do the nested ordering inline like the following:
from django.db.models import Prefetch
prefetched_prices = Prefetch("prices", queryset=Item_Price.objects.order_by("price"))
for i in Item.objects.prefetch_related(prefetched_prices): i.name, i.prices.all()

Django ORM Subqueries

I'm trying to figure out how to perform the following SQL query with the Django ORM:
SELECT main.A, main.B, main.C
FROM
(SELECT main.A, MAX(main.B)
FROM main
GROUP BY main.A) subq
WHERE main.A = subq.A
AND main.B = subq.B
The last two lines are necessary because they recover the column C value when B is at a maximum in the group by. Without them, I would have A and the corresponding Max B but not the C value when B is at its max. I have searched extensively but cannot find an example that can construct this query using the Django ORM. Most examples use Django's Subquery class and show how to match the sub-queryset up with one column (so doing main.A = subq.A). But how do I match 2+ columns?
Edit:
Here is the model class:
class Tweets(models.Model):
tweet_id = models.AutoField(primary_key=True)
tweet_date = models.DateTimeField(blank=True)
candidate = models.CharField(max_length=100)
district = models.IntegerField(blank=True)
username = models.CharField(max_length=256)
likes = models.IntegerField(blank=True)
tweet_text = models.CharField(max_length=560)
I'd like to group by "candidate" and "district", then find the tweet with the most likes. But I'd also like to know the "username" and "tweet_text" associated with that tweet that had the most likes.

Django: How can I add an aggregated field to a queryset based on data from the row and data from another Model?

I have a Django App with the following models:
CURRENCY_CHOICES = (('USD', 'US Dollars'), ('EUR', 'Euro'))
class ExchangeRate(models.Model):
currency = models.CharField(max_length=3, default='USD', choices=CURRENCY_CHOICES)
rate = models.FloatField()
exchange_date = models.DateField()
class Donation(models.Model):
donation_date = models.DateField()
donor = models.CharField(max_length=250)
amount = models.FloatField()
currency = models.CharField(max_length=3, default='USD', choices=CURRENCY_CHOICES)
I also have a form I use to filter donations based on some criteria:
class DonationFilterForm(forms.Form)
min_amount = models.FloatField(required=False)
max_amount = models.FloatField(required=False)
The min_amount and max_amount fields will always represent values in US Dollars.
I need to be able to filter a queryset based on min_amount and max_amount, but for that all the amounts must be in USD. To convert the donation amount to USD I need to multiply by the ExchangeRate of the donation currency and date.
The only way I found of doing this so far is by iterating the dict(queryset) and adding a new value called usd_amount, but that may offer very poor performance in the future.
Reading Django documentation, it seems the same thing can be done using aggregation, but so far I haven't been able to create the right logic that would give me same result.
I knew I had to use annotate to solve this, but I didn't know exactly how because it involved getting data from an unrelated Model.
Upon further investigation I found what I needed in the Django Documentation. I needed to use the Subquery and the OuterRef expressions to get the values from the outer queryset so I could filter the inner queryset.
The final solution looks like this:
# Prepare the filter with dynamic fields using OuterRef
rates = ExchangeRate.objects.filter(exchange_date=OuterRef('date'), currency='EUR')
# Get the exchange rate for every donation made in Euros
qs = Donation.objects.filter(currency='EUR').annotate(exchange_rate=Subquery(rates.values('rate')[:1]))
# Get the equivalent amount in USD
qs = qs.annotate(usd_amount=F('amount') * F('exchange_rate'))
So, finally, I could filter the resulting queryset like so:
final_qs = qs.filter(usd_amount__gte=min_amount, usd_amount__lte=max_amount)

In Django ORM: Select record from each group with maximal value of a given attribute

Say I have three models as follows representing the prices of goods sold at several retail locations of the same company:
class Store(models.Model):
name = models.CharField(max_length=256)
address = models.TextField()
class Product(models.Model):
name = models.CharField(max_length=256)
description = models.TextField()
class Price(models.Model):
store = models.ForeignKey(Store)
product = models.ForeignKey(Product)
effective_date = models.DateField()
value = models.FloatField()
When a price is set, it is set on a store-and-product-specific basis. I.e. the same item can have different prices in different stores. And each of these prices has an effective date. For a given store and a given product, the currently-effective price is the one with the latest effective_date.
What's the best way to write the query that will return the currently-effective price of all items in all stores?
If I were using Pandas, I would get myself a dataframe with columns ['store', 'product', 'effective_date', 'price'] and I would run
dataframe\
.sort_values(columns=['store', 'product', 'effective_date'], ascending=[True, True, False])\
.groupby('store', 'product')['price'].first()
But there has to be some way of doing this directly on the database level. Thoughts?
If your DBMS is PostgreSQL you can use distinct combined with order_by this way :
Price.objects.order_by('store','product','-effective_date').distinct('store','product')
It will give you all the latest prices for all product/store combinations.
There are tricks about distinct, have a look at the docs here : https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.distinct
Without Postgres' added power (which you should really use) there is a more complicated solution to this (based on ryanpitts' idea), which requires two db hits:
latest_set = Price.objects
.values('store_id', 'product_id') # important to have values before annotate ...
.annotate(max_date=Max('effective_date')).order_by()
# ... to annotate for the grouping that results from values
# Build a query that reverse-engineers the Price records that contributed to
# 'latest_set'. (Relying on the fact that there are not 2 Prices
# for the same product-store with an identical date)
q_statement = Q(product_id=-1) # sth. that results in empty qs
for latest_dict in latest_set:
q_statement |=
(Q(product_id=latest_dict['product_id']) &
Q(store_id=latest_dict['store_id']) &
Q(effective_date=latest_dict['max_date']))
Price.objects.filter(q_statement)
If you are using PostgreSQL, you could use order_by and distinct to get the current effective prices for all the products in all the stores as follows:
prices = Price.objects.order_by('store', 'product', '-effective_date')
.distinct('store', 'product')
Now, this is quite analogous to what you have there for Pandas.
Do note that using field names in distinct only works in PostgreSQL. Once you have sorted the prices based on store, product and decreasing order of effective date, distinct('store', 'product') will retain only the first entry for each store-product pair and that will be your current entry with recent price.
Not PostgreSQL database:
If you are not using PostgreSQL, you could do it with two queries:
First, we get latest effective date for all the store-product groups:
latest_effective_dates = Price.objects.values('store_id', 'product_id')
.annotate(led=Max('effective_date')).values('led')
Once we have these dated we could get the prices for this date:
prices = Price.objects.filter(effective_date__in=latest_effective_dates)
Disclaimer: This assumes that for no effective_date is same for any store-product group.

Categories

Resources