Django Q objects and m2m queries - python

I'm totally flummoxed by this behavior. I clearly don't understand Q objects like I thought I did or I'm doing something massively stupid and obvious. Here's what I'm running into. accepted_attendee_* are all m2m relationships to OfficialProfile. In the django shell, for ease of demonstration.
>>> profile = OfficialProfile.objects.get(user__username='testofficial3')
>>> r = SchedulerEvent.objects.filter(accepted_attendee_referee=profile)
>>> l = SchedulerEvent.objects.filter(accepted_attendee_linesman=profile)
>>> o = SchedulerEvent.objects.filter(accepted_attendee_official=profile)
>>> r
[<SchedulerEvent: andrew>]
>>> l
[]
>>> o
[]
This is all as expected. Now tho, if I combine together with a Q object, things get weird.
>>> qevents = SchedulerEvent.objects.filter(Q(accepted_attendee_referee=profile)|Q(accepted_attendee_official=profile)|Q(accepted_attendee_linesman=profile))
>>> qevents
[<SchedulerEvent: andrew>, <SchedulerEvent: andrew>]
Two objects are returned, both with the same PK - two duplicate objects. Should be only one, based on the individual queries. But once again, when I do this:
>>> r|l|o
[<SchedulerEvent: andrew>, <SchedulerEvent: andrew>]
What is it about this OR query that returns two objects when there should, I believe quite clearly, be only one?
EDIT
So I looked at the query that was produced and it seems like the "answer" has nothing to do with Q objects or the OR'ing at all; rather, it's the way the ORM joins the table. Here's the SQL and the results it generates, absent the OR:
mysql> SELECT `scheduler_schedulerevent`.`id`, `scheduler_schedulerevent`.`user_id`, `scheduler_schedulerevent`.`title`, `scheduler_schedulerevent`.`description`, `scheduler_schedulerevent`.`start`, `scheduler_schedulerevent`.`end`, `scheduler_schedulerevent`.`location_id`, `scheduler_schedulerevent`.`age_level_id`, `scheduler_schedulerevent`.`skill_level_id`, `scheduler_schedulerevent`.`officiating_system_id`, `scheduler_schedulerevent`.`auto_schedule`, `scheduler_schedulerevent`.`is_scheduled`
FROM `scheduler_schedulerevent`
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_referee`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_referee`.`schedulerevent_id` )
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_linesman`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_linesman`.`schedulerevent_id` )
LEFT OUTER JOIN `scheduler_schedulerevent_accepted_attendee_official`
ON ( `scheduler_schedulerevent`.`id` = `scheduler_schedulerevent_accepted_attendee_official`.`schedulerevent_id` );
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
| id | user_id | title | description | start | end | location_id | age_level_id | skill_level_id | officiating_system_id | auto_schedule | is_scheduled |
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
| 1 | 1 | Test Event | | 2015-04-09 02:00:00 | 2015-04-09 02:30:00 | 161 | 1 | 1 | 3 | 0 | 0 |
| 2 | 1 | Test | | 2015-04-07 20:00:00 | 2015-04-07 21:00:00 | 161 | 1 | 1 | 3 | 1 | 0 |
| 3 | 1 | Test Auto | | 2015-04-07 20:00:00 | 2015-04-07 20:30:00 | 161 | 1 | 1 | 2 | 0 | 0 |
| 4 | 1 | Test Official | | 2015-04-16 19:00:00 | 2015-04-16 20:30:00 | 161 | 1 | 1 | 3 | 0 | 1 |
| 4 | 1 | Test Official | | 2015-04-16 19:00:00 | 2015-04-16 20:30:00 | 161 | 1 | 1 | 3 | 0 | 1 |
+----+---------+---------------+-------------+---------------------+---------------------+-------------+--------------+----------------+-----------------------+---------------+--------------+
and then clearly, when you add an OR, it satisfies two query conditions based on the results of the join. So while adding distinct to the query seems unnecessary, it is quite necessary.

If you are getting two objects and it is the same, is because that object satisfies at least two queryset.
In other words, <SchedulerEvent: andrew> satisfy with at least two queryset r l o
If you don't want duplicate objects, use .distinct() function.
SchedulerEvent.objects.filter(Q(accepted_attendee_referee=profile)|Q(accepted_attendee_official=profile)
|Q(accepted_attendee_linesman=profile)).distinct()

No you aren't being thick , but just accept that the underlying system (django's ORM) and Q objects will match 2 elements which are the same and therefore only one is actually matched. Q objects are very powerful when used correctly such as searching a query in a database. In this example you do not need Q objects but a simple filter would work fine.
Have you tried how a simple .filter(...) behaves in your case?
In the end you are not trying to understand why it is so that Q objects will return that queryset; you are trying to get a certain result, and distinct() works fine :)

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Django queryset - Add HAVING constraint after annotate(F())

I had a seemingly normal situation with adding HAVING to the query.
I read here and here, but it did not help me
I need add HAVING to my query
MODELS :
class Package(models.Model):
status = models.IntegerField()
class Product(models.Model):
title = models.CharField(max_length=10)
packages = models.ManyToManyField(Package)
Products:
|id|title|
| - | - |
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
Packages:
|id|status|
| - | - |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 2 |
Product_Packages:
|product_id|package_id|
| - | - |
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
| 3 | 2 |
| 2 | 3 |
| 4 | 3 |
| 4 | 4 |
visual
pack_1 (A, B) status OK
pack_2 (B, C) status not ok
pack_3 (B, D) status OK
pack_4 (D) status not ok
My task is to select those products that have the latest package in status = 1
Expected result is : A, B
my query is like this
SELECT prod.title, max(tp.id)
FROM "product" as prod
INNER JOIN "product_packages" as p_p ON (p.id = p_p.product_id)
INNER JOIN "package" as pack ON (pack.id = p_p.package_id)
GROUP BY prod.title
HAVING pack.status = 1
it returns exactly what I needed
|title|max(pack.id)|
| - | - |
| A | 1 |
| B | 3 |
BUT my orm does not work correctly
I try like this
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id', 'pack_id')
p.query
SELECT "product"."id", MAX("product_packages"."package_id") AS "pack_id"
FROM "product" LEFT OUTER JOIN "product_packages" ON ("product"."id" = "product_packages"."product_id") LEFT OUTER JOIN "package" ON ("product_packages"."package_id" = "package"."id")
WHERE "package"."status" = 1
GROUP BY "product"."id"
please help me to make correct ORM
How the query looks like when you remove
.values('id', 'pack_id')
At the end?
If I remember correctly then:
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1)
and
p = Product.objects.annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id')
Will result with different queries
Haki Benita has an excellent site that is made for Database Gurus and how to make to most of Django.
You can take a look at this post:
https://hakibenita.com/django-group-by-sql#how-to-use-having
Django has a very specific way of adding the "HAVING" operator, i.e. your query set needs to be structured so that your annotation is followed by a 'values' call to single out the column you want to group by, then annotate the Max or whatever aggregate you want.
Also this annotation seems like it won't work annotate(my_status = F('packages__status') you want to annotate multiple status to a single annotation.
You might want to try a subquery to annotate the way you want.
e.g.
Product.objects.annotate(
latest_pack_id=Subquery(
Package.objects.order_by('-pk').filter(status=1).values('pk')[:1]
)
).filter(
packages__in=F('latest_pack_id')
)
Or something along those lines, I haven't tested this out
I think you can try like this with subquery:
from django.db.models import OuterRef, Subquery
sub_query = Package.objects.filter(product=OuterRef('pk')).order_by('-pk')
products = Product.objects.annotate(latest_package_status=Subquery(sub_query.values('status')[0])).filter(latest_package_status=1)
Here first I am preparing the subquery by filtering the Package model with Product's primary key and ordering them by Package's primary key. Then I took the latest value from the subquery and annotating it with Product queryset, and filtering out the status with 1.

How to query collection of objects while grouping by a property

So I am trying to write a query that will select a collection of objects that are distinct on a certain property.
Action:
+-----+-----------+-----+
| id | timestamp | ... |
+-----+-----------+-----+
| 10 | 16:04 | ... |
| 11 | 16:06 | ... |
| 12 | 16:08 | ... |
| 13 | 16:09 | ... |
| 14 | 16:10 | ... |
+-----+-----------+-----+
FooVersion:
+----+--------+-----------+-------------------+
| id | foo_id | action_id | foo_zab |
+----+--------+-----------+-------------------+
| 1 | 1 | 10 | xx |
| 2 | 2 | 11 | yy |
| 3 | 3 | 12 | zz |
| 4 | 3 | 13 | zy |
| 5 | 3 | 14 | zx |
+----+--------+-----------+-------------------+
Foo:
+----+-----+
| id | zab |
+----+-----+
| 1 | xx |
| 2 | yy |
| 3 | zx |
+----+-----+
A scene is made up of a collection of foos. I am trying to track the changes in each particular foo over time. Therefore, each time a change is made to foo, the action that caused that change is recorded and a copy of some of foo's properties are stored in the foo_versions table
What I am looking for is the "state of the foos at a particular action". So, while action #11 only specifically links to foo, the state of the scene at action #11 actually contains 3 foos, the versions of which are foo_version #1, #2, and #5
I need to construct a query that will say "for a specified action, give me the representation of the scene"
For action #10, the scene would be [<foo_version #1>]
For action #12, the scene would be [<foo_version #1>, <foo_version #2>, <foo_version #3>]
This is where it gets tricky. For action #14, the representation of the scene is [<foo_version #1>, <foo_version #2>, <foo_version #5>]. Foo versions #3, #4, and #5 all refer to the same foo. So, foo versions #3 and #4 are overwritten by #5.
I am using this sqlalchemy query:
stmt = db.session.query(Action).filter(Action.timestamp <= action.timestamp).subquery()
action_alias = aliased(Action, stmt)
foo_versions = db.session.query(FooVersion) \
.join(Action) \
.join(action_alias, FooVersion.action) \
.filter(Action.frame_id == frame.id) \
.all()
The result I am getting is
[<foo_version #1>, <foo_version #2>, <foo_version #3>, <foo_version #4>, <foo_version #5>]]]
|____________| |____________|
^ ^
| |
I need to get rid of these versions -> | |
since they have been overwritten
I am not familiar with python but here is the SQL query if I got correctly what you want:
SELECT a.id as action_id, timestamp,
v.id as version, foo_id, foo_zab
FROM action a
JOIN foo_version v
ON ( v.action_id IN
(
SELECT MAX(inner_v.action_id)
FROM foo_version inner_v
WHERE a.id >= inner_v.action_id
GROUP BY inner_v.foo_id
)
)
JOIN foo ON ( foo.id = v.foo_id )
ORDER BY a.id, version
It will give you each action row repeated with the data of each version
and here is a SQL fiddle demo

Pandas - applying groupings and counts to multiple columns in order to generate/change a dataframe

I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.

SQL Alchemy group_by in Query

I am trying to select a grouped average.
a1_avg = session.query(func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
.group_by(Table_A.a1_group)
I have tried a few different iterations of this query and this is as close as I can get to what I need. I am fairly certain the group_by is creating the issue, but I am unsure how to correctly implement the query using SQLA. The table structure and expected output:
TABLE A
A1_ID | A1_VALUE | A1_DATE | A1_LOC | A1_GROUP
1 | 5 | 2011-10-05 | 5 | 6
2 | 15 | 2011-10-14 | 5 | 6
3 | 2 | 2011-10-21 | 6 | 7
4 | 20 | 2011-11-15 | 4 | 8
5 | 6 | 2011-10-27 | 6 | 7
EXPECTED OUTPUT
A1_LOC | A1_GROUP | A1_AVG
5 | 6 | 10
6 | 7 | 4
I would guess that you are just missing the group identifier (a1_group) in the result. Also (given I understand your model correctly), you need to add a group by clause also for a1_loc column:
edit-1: updated the query due to question specificaion
a1_avg = session.query(Table_A.a1_loc, Table_A.a1_group, func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
#.filter(Table_A.a1_id == '12')\ # #note: you do NOT NEED this
.group_by(Table_A.a1_loc)\ # #note: you NEED this
.group_by(Table_A.a1_group)

Categories

Resources