Django queryset - Add HAVING constraint after annotate(F())

Django queryset - Add HAVING constraint after annotate(F()) - python

I had a seemingly normal situation with adding HAVING to the query.
I read here and here, but it did not help me
I need add HAVING to my query
MODELS :
class Package(models.Model):
status = models.IntegerField()
class Product(models.Model):
title = models.CharField(max_length=10)
packages = models.ManyToManyField(Package)
Products:
|id|title|
| - | - |
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
Packages:
|id|status|
| - | - |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 2 |
Product_Packages:
|product_id|package_id|
| - | - |
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
| 3 | 2 |
| 2 | 3 |
| 4 | 3 |
| 4 | 4 |
visual
pack_1 (A, B) status OK
pack_2 (B, C) status not ok
pack_3 (B, D) status OK
pack_4 (D) status not ok
My task is to select those products that have the latest package in status = 1
Expected result is : A, B
my query is like this
SELECT prod.title, max(tp.id)
FROM "product" as prod
INNER JOIN "product_packages" as p_p ON (p.id = p_p.product_id)
INNER JOIN "package" as pack ON (pack.id = p_p.package_id)
GROUP BY prod.title
HAVING pack.status = 1
it returns exactly what I needed
|title|max(pack.id)|
| - | - |
| A | 1 |
| B | 3 |
BUT my orm does not work correctly
I try like this
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id', 'pack_id')
p.query
SELECT "product"."id", MAX("product_packages"."package_id") AS "pack_id"
FROM "product" LEFT OUTER JOIN "product_packages" ON ("product"."id" = "product_packages"."product_id") LEFT OUTER JOIN "package" ON ("product_packages"."package_id" = "package"."id")
WHERE "package"."status" = 1
GROUP BY "product"."id"
please help me to make correct ORM

How the query looks like when you remove
.values('id', 'pack_id')
At the end?
If I remember correctly then:
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1)
and
p = Product.objects.annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id')
Will result with different queries

Haki Benita has an excellent site that is made for Database Gurus and how to make to most of Django.
You can take a look at this post:
https://hakibenita.com/django-group-by-sql#how-to-use-having
Django has a very specific way of adding the "HAVING" operator, i.e. your query set needs to be structured so that your annotation is followed by a 'values' call to single out the column you want to group by, then annotate the Max or whatever aggregate you want.
Also this annotation seems like it won't work annotate(my_status = F('packages__status') you want to annotate multiple status to a single annotation.
You might want to try a subquery to annotate the way you want.
e.g.
Product.objects.annotate(
latest_pack_id=Subquery(
Package.objects.order_by('-pk').filter(status=1).values('pk')[:1]
)
).filter(
packages__in=F('latest_pack_id')
)
Or something along those lines, I haven't tested this out

I think you can try like this with subquery:
from django.db.models import OuterRef, Subquery
sub_query = Package.objects.filter(product=OuterRef('pk')).order_by('-pk')
products = Product.objects.annotate(latest_package_status=Subquery(sub_query.values('status')[0])).filter(latest_package_status=1)
Here first I am preparing the subquery by filtering the Package model with Product's primary key and ordering them by Package's primary key. Then I took the latest value from the subquery and annotating it with Product queryset, and filtering out the status with 1.

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.

You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Annotating a count of a superset of fields with Django

So the setup here is I have a Post table that contains a bunch of posts. Some of these rows are different versions of the same post, which are grouped together by post_version_group_id, so it looks like something like:
pk | title | post_version_group_id
1 | a | 123
2 | b | 789
3 | c | 123
4 | d | 123
so there are two "groups" of posts, and 4 posts in total. Now each post has a foreign key pointing to a PostDownloads table that looks like
post | user_downloaded
1 | user1
2 | user2
3 | user3
4 | user4
what I'd like to be able to do is annotate my Post queryset so that it looks like:
pk | title | post_version_group_id | download_count
1 | a | 123 | 3
2 | b | 789 | 1
3 | c | 123 | 3
4 | d | 123 | 3
i.e have all the posts with the same post_version_group_id have the same count (being the sum of downloads across the different versions).
At the moment, I'm currently doing:
Post.objects.all().annotate(download_count=models.Count("downloads__user_downloaded, distinct=True))
which doesn't quite work, it annotates a download_count which looks like:
pk | title | post_version_group_id | download_count
1 | a | 123 | 1
2 | b | 789 | 1
3 | c | 123 | 1
4 | d | 123 | 1
since the downloads__user_downloaded seems to only be limited to the set of rows inside the downloads table that is linked to the current post row being annotate, which makes sense - really, but is working against me in this particular case.
One thing I've also tried is
Post.objects.all().values("post_version_group_id").annotate(download_count=Count("downloads__user_downloaded", distinct=True))
which kind of works, but the .values() bit breaks the queryset and of post instances to queryset of dicts - and I need it to stay a queryset of post instances.
The actual models look something like:
class Post:
title = models.CharField()
post_version_group_id = models.UUIDField()
class PostDownloads:
post = models.ForeignKey(Post)
user_downloaded = models.ForeignKey(User)

So, I ended up figuring this out and thought I'd post the answer for anybody else that got stuck in the same rut. The key here was using a Subquery, but not just any Subquery - a custom one that returns a count of rows rather then a default Subquery type that returns a single row of data.
First step is defining this custom subquery type:
class SubqueryCount(models.Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = models.IntegerField()
Then building the subquery:
downloads_subquery = PostDownloads
.objects
.filter(
post__post_version_group_id=models.OuterRef(
"post_version_group_id"
)
)
.distinct("user")
which filters based on that grouping version id I had.
And finally, executing the subquery in the annotation:
Post.objects.annotate(download_count=SubqueryCount(downloads_subquery))

Django - How to combine 2 queryset and filter to get same element in both queryset?

I have a model:
class LocationItem(models.Model):
location = models.ForeignKey(Location, on_delete=models.CASCADE)
item = models.ForeignKey(Item, on_delete=models.CASCADE)
stock_qty = models.IntegerField(null=True)
Example: I have some data like this:
------------------------------
| ID | Item | Location | Qty |
------------------------------
| 1 | 1 | 1 | 10 |
------------------------------
| 2 | 2 | 1 | 5 |
------------------------------
| 3 | 1 | 2 | 2 |
------------------------------
| 4 | 3 | 1 | 4 |
------------------------------
| 5 | 3 | 2 | 20 |
------------------------------
I have 2 queryset to get items of each location:
location_1 = LocationItem.objects.filter(location_id=1)
location_2 = LocationItem.objects.filter(location_id=2)
Now I want to combine 2 queryset above into 1 and filter only same items in both 2 location such as result of this example above is [Item 1, Item 3] because item 1 and 3 belong to both location 1 and 2

You can combine django query set using following expression
location_1 = LocationItem.objects.filter(location_id=1)
location_2 = LocationItem.objects.filter(location_id=2)
location = location_1 | location_2
Above combine expression works on same model filter query set.
Try this one
from django.db.models import Count
dupes = LocationItem.objects.values('item__id').annotate(Count('id')).order_by().filter(id__count__gt=1)
LocationItem.objects.filter(item__=[i['item__id'] for i in dupes]).distinct('item__id')
May be above solution help.

If you want both conditions to be true, then you need the AND operator (&)
from django.db.models import Q
Q(location_1) & Q(location_2)

Try this:
location1_items_pk = LocationItem.objects.filter(
location_id=1
).values_list("item_pk", flat=true)
Result = Location.objects.filter(
item_pk__in=location1_items_pk, location_id=2
)

You can do this by piping the filters. The result of a filter is a queryset. So after the first filtering, the result will be [Item1, Item2 , Item3] and then second filter will be applied on the resulting queryset which leads [Item1, Item3]. For eg.
Item.objects.filter(locationitem_set__location = 1).filter(locationitem_set__location = 2)
P.S. Not tested. Hope this works.

Using F() expressions with lookup of position from a list to update objects

I have 4 BaseReward objects, that have a default ordering of (rank, id) in Meta class. My aim is to update the objects such that I preserve their relative rankings, but give them unique ranks starting from 1,
Originally:
| id | rank |
|----|------|
| 1 | 3 |
| 2 | 2 |
| 3 | 2 |
| 4 | 1 |
after calling rerank_rewards_to_have_unique_ranks() should become
| id | rank |
|----|------|
| 1 | 4 |
| 2 | 2 |
| 3 | 3 |
| 4 | 1 |
I am trying to use F() expression with lookup .index() on list, but Django won't accept it as F() expression has only a fixed set of operators https://docs.djangoproject.com/en/1.8/topics/db/queries/#filters-can-reference-fields-on-the-model
Is there another way of achieving the same in an optimized way, without bringing the objects to database?
models.py
class BaseReward(models.Model):
class Meta:
ordering = ('rank', 'id')
# BaseReward.objects.all() gets the objects ordered by 'rank' as in the Meta class, and then by id if two objects have same rank
helper.py
def rerank_rewards_to_have_unique_ranks():
qs = BaseReward.objects.all() # this would give the rewards of that category ordered by [rank, id]
id_list_in_order_of_rank = list(qs.values_list('id', flat=True)) # get the ordered list of ids sequenced in order of ranks
# now I want to update the ranks of the objects, such that rank = the desired rank
BaseReward.objects.all().update(rank=id_list_in_order_of_rank.index(F('id'))+1)

Django Q objects and m2m queries

I'm totally flummoxed by this behavior. I clearly don't understand Q objects like I thought I did or I'm doing something massively stupid and obvious. Here's what I'm running into. accepted_attendee_* are all m2m relationships to OfficialProfile. In the django shell, for ease of demonstration.
>>> profile = OfficialProfile.objects.get(user__username='testofficial3')
>>> r = SchedulerEvent.objects.filter(accepted_attendee_referee=profile)
>>> l = SchedulerEvent.objects.filter(accepted_attendee_linesman=profile)
>>> o = SchedulerEvent.objects.filter(accepted_attendee_official=profile)
>>> r
[<SchedulerEvent: andrew>]
>>> l
[]
>>> o
[]
This is all as expected. Now tho, if I combine together with a Q object, things get weird.
>>> qevents = SchedulerEvent.objects.filter(Q(accepted_attendee_referee=profile)|Q(accepted_attendee_official=profile)|Q(accepted_attendee_linesman=profile))
>>> qevents
[<SchedulerEvent: andrew>, <SchedulerEvent: andrew>]
Two objects are returned, both with the same PK - two duplicate objects. Should be only one, based on the individual queries. But once again, when I do this:
>>> r|l|o
[<SchedulerEvent: andrew>, <SchedulerEvent: andrew>]
What is it about this OR query that returns two objects when there should, I believe quite clearly, be only one?
EDIT
So I looked at the query that was produced and it seems like the "answer" has nothing to do with Q objects or the OR'ing at all; rather, it's the way the ORM joins the table. Here's the SQL and the results it generates, absent the OR:
mysql> SELECT `scheduler_schedulerevent`.`id`, `scheduler_schedulerevent`.`user_id`, `scheduler_schedulerevent`.`title`, `scheduler_schedulerevent`.`description`, `scheduler_schedulerevent`.`start`, `scheduler_schedulerevent`.`end`, `scheduler_schedulerevent`.`location_id`, `scheduler_schedulerevent`.`age_level_id`, `scheduler_schedulerevent`.`skill_level_id`, `scheduler_schedulerevent`.`officiating_system_id`, `scheduler_schedulerevent`.`auto_schedule`, `scheduler_schedulerevent`.`is_scheduled`
FROM `scheduler_schedulerevent`
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_referee`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_referee`.`schedulerevent_id` )
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_linesman`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_linesman`.`schedulerevent_id` )
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_official`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_official`.`schedulerevent_id` );
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
| id | user_id | title | description | start | end | location_id | age_level_id | skill_level_id | officiating_system_id | auto_schedule | is_scheduled |
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
| 1 | 1 | Test Event | | 2015-04-09 02:00:00 | 2015-04-09 02:30:00 | 161 | 1 | 1 | 3 | 0 | 0 |
| 2 | 1 | Test | | 2015-04-07 20:00:00 | 2015-04-07 21:00:00 | 161 | 1 | 1 | 3 | 1 | 0 |
| 3 | 1 | Test Auto | | 2015-04-07 20:00:00 | 2015-04-07 20:30:00 | 161 | 1 | 1 | 2 | 0 | 0 |
| 4 | 1 | Test Official | | 2015-04-16 19:00:00 | 2015-04-16 20:30:00 | 161 | 1 | 1 | 3 | 0 | 1 |
| 4 | 1 | Test Official | | 2015-04-16 19:00:00 | 2015-04-16 20:30:00 | 161 | 1 | 1 | 3 | 0 | 1 |
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
and then clearly, when you add an OR, it satisfies two query conditions based on the results of the join. So while adding distinct to the query seems unnecessary, it is quite necessary.

If you are getting two objects and it is the same, is because that object satisfies at least two queryset.
In other words, <SchedulerEvent: andrew> satisfy with at least two queryset r l o
If you don't want duplicate objects, use .distinct() function.
SchedulerEvent.objects.filter(Q(accepted_attendee_referee=profile)|Q(accepted_attendee_official=profile)
|Q(accepted_attendee_linesman=profile)).distinct()

No you aren't being thick , but just accept that the underlying system (django's ORM) and Q objects will match 2 elements which are the same and therefore only one is actually matched. Q objects are very powerful when used correctly such as searching a query in a database. In this example you do not need Q objects but a simple filter would work fine.
Have you tried how a simple .filter(...) behaves in your case?
In the end you are not trying to understand why it is so that Q objects will return that queryset; you are trying to get a certain result, and distinct() works fine :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django queryset - Add HAVING constraint after annotate(F()) - python

Related

PySpark - How to group by rows and then map them using custom function

Annotating a count of a superset of fields with Django

Django - How to combine 2 queryset and filter to get same element in both queryset?

Using F() expressions with lookup of position from a list to update objects

Django Q objects and m2m queries

Categories

Resources