How does set_group_by works in Django?

How does set_group_by works in Django? - python

I was writing the following query:
claim_query = ClaimBillSum.objects.filter(claim__lob__in = lobObj)\
.annotate(claim_count = Count("claim__claim_id", distinct=True))\
.annotate(claim_bill_sum = Sum("bill_sum"))\
.values("claim__body_part", "claim_count", "claim_bill_sum")\
.order_by("claim__body_part")
When I checked the query property, it was grouped by all properties of the tables related in this query, not only the ones selected in the values() function, when I only wanted to group by claim__body_part.
As I searched for a way to change the group by instruction, I found the query.set_group_by() function, that when applied, fixed the query in the way I wanted:
claim_query.query.set_group_by()
SELECT
"CLAIM"."body_part",
COUNT(DISTINCT "claim_bill_sum"."claim_id") AS "claim_count",
SUM("claim_bill_sum"."bill_sum") AS "claim_bill_sum"
FROM
"claim_bill_sum"
INNER JOIN "CLAIM" ON
("claim_bill_sum"."claim_id" = "CLAIM"."claim_id")
WHERE
"CLAIM"."lob_id" IN (SELECT U0."lob_id" FROM "LOB" U0 WHERE U0."client_id" = 1)
GROUP BY
"CLAIM"."body_part"
ORDER BY
"CLAIM"."body_part" ASC
But I couldn't find any information in Django documentation or anywhere else to better describe how this function works. Why the default group by is selecting all properties, and how .set_group_by() works, selecting exactly the property I wanted?

Related

Django: aggregate returns a wrong result after using annotate

When aggregating a queryset, I noticed that if I use an annotation before, I get a wrong result. I can't understand why.
The Code
from django.db.models import QuerySet, Max, F, ExpressionWrapper, DecimalField, Sum
from orders.models import OrderOperation
class OrderOperationQuerySet(QuerySet):
def last_only(self) -> QuerySet:
return self \
.annotate(last_oo_pk=Max('order__orderoperation__pk')) \
.filter(pk=F('last_oo_pk'))
#staticmethod
def _hist_price(orderable_field):
return ExpressionWrapper(
F(f'{orderable_field}__hist_unit_price') * F(f'{orderable_field}__quantity'),
output_field=DecimalField())
def ordered_articles_data(self):
return self.aggregate(
sum_ordered_articles_amounts=Sum(self._hist_price('orderedarticle')))
The Test
qs1 = OrderOperation.objects.filter(order__pk=31655)
qs2 = OrderOperation.objects.filter(order__pk=31655).last_only()
assert qs1.count() == qs2.count() == 1 and qs1[0] == qs2[0] # shows that both querysets contains the same object
qs1.ordered_articles_data()
> {'sum_ordered_articles_amounts': Decimal('3.72')} # expected result
qs2.ordered_articles_data()
> {'sum_ordered_articles_amounts': Decimal('3.01')} # wrong result
How is it possible that this last_only annotation method can make the aggregation result different (and wrong)?
The "funny" thing is that is seems to happen only when the order contains articles that have the same hist_price:
Side note
I can confirm that the SQL created by Django ORM is probably wrong, because when I force execution of last_only() and then I call aggregation in a second query, it works as expected.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#combining-multiple-aggregations could be an explanation?
SQL Queries
(note that these are the actual queries but the code above has been slightly simplified, which explains the presence below of COALESCE and "deleted" IS NULL.)
-- qs1.ordered_articles_data()
SELECT
COALESCE(
SUM(
("orders_orderedarticle"."hist_unit_price" * "orders_orderedarticle"."quantity")
),
0) AS "sum_ordered_articles_amounts"
FROM "orders_orderoperation"
LEFT OUTER JOIN "orders_orderedarticle"
ON ("orders_orderoperation"."id" = "orders_orderedarticle"."order_operation_id")
WHERE ("orders_orderoperation"."order_id" = 31655 AND "orders_orderoperation"."deleted" IS NULL)
-- qs2.ordered_articles_data()
SELECT COALESCE(SUM(("__col1" * "__col2")), 0)
FROM (
SELECT
"orders_orderoperation"."id" AS Col1,
MAX(T3."id") AS "last_oo_pk",
"orders_orderedarticle"."hist_unit_price" AS "__col1",
"orders_orderedarticle"."quantity" AS "__col2"
FROM "orders_orderoperation" INNER JOIN "orders_order"
ON ("orders_orderoperation"."order_id" = "orders_order"."id")
LEFT OUTER JOIN "orders_orderoperation" T3
ON ("orders_order"."id" = T3."order_id")
LEFT OUTER JOIN "orders_orderedarticle"
ON ("orders_orderoperation"."id" = "orders_orderedarticle"."order_operation_id")
WHERE ("orders_orderoperation"."order_id" = 31655 AND "orders_orderoperation"."deleted" IS NULL)
GROUP BY
"orders_orderoperation"."id",
"orders_orderedarticle"."hist_unit_price",
"orders_orderedarticle"."quantity"
HAVING "orders_orderoperation"."id" = (MAX(T3."id"))
) subquery

When you use any annotation in the database language(Aggregate Functions) you should to do group by all fields not inside the function, and you can see it inside the subquery
GROUP BY
"orders_orderoperation"."id",
"orders_orderedarticle"."hist_unit_price",
"orders_orderedarticle"."quantity"
HAVING "orders_orderoperation"."id" = (MAX(T3."id"))
As result the goods with the same hist_unit_price and quantity is filtered by max id. So, based on your screen, one of the chocolate or cafe is excluded by the having condition.

A separation to subqueries with smaller joins is a solution to prevent problems with more joins to children objects, possibly with an unnecessary huge Cartesian product of independent sets or a complicated control of the GROUP BY clause in the result SQL by contribution from more elements of the query.
solution: A subquery is used to get primary keys of the last order operations.
A simple query without added joins or groups is used to be not distorted by a possible aggregation on children.
def last_only(self) -> QuerySet:
max_ids = (self.values('order').order_by()
.annotate(last_oo_pk=Max('order__orderoperation__pk'))
.values('last_oo_pk')
)
return self.filter(pk__in=max_ids)
test
ret = (OrderOperationQuerySet(OrderOperation).filter(order__in=[some_order])
.last_only().ordered_articles_data())
executed SQL: (simplified by removing app name prefix order_ and double quetes ")
SELECT CAST(SUM((orderedarticle.hist_unit_price * orderedarticle.quantity))
AS NUMERIC) AS sum_ordered_articles_amounts
FROM orderoperation
LEFT OUTER JOIN orderedarticle ON (orderoperation.id = orderedarticle.order_operation_id)
WHERE (
orderoperation.order_id IN (31655) AND
orderoperation.id IN (
SELECT MAX(U2.id) AS last_oo_pk
FROM orderoperation U0
INNER JOIN order U1 ON (U0.order_id = U1.id)
LEFT OUTER JOIN orderoperation U2 ON (U1.id = U2.order_id)
WHERE U0.order_id IN (31655)
GROUP BY U0.order_id
)
)
The original invalid SQL could be fixed by adding orders_orderedarticle".id to GROUP BY, but only if last_only() and ordered_articles_data() are used together. That is not good readable way.

Order by sometimes not work in my query

I defined two classes:
class OrderEntryVacancyRenew(OrderEntry):
...
vacancy_id = db.Column(db.Integer, db.ForeignKey('vacancy.id'), nullable=False)
vacancy = db.relationship('Vacancy')
remaining = db.Column(db.SmallInteger)
class Vacancy(db.Model):
id = db.Column(db.Integer, autoincrement=True, primary_key=True)
renew_at = db.Column(TZDateTime, index=True)
Then I defined the method to refresh OrderEntryVacancyRenew.remaining and Vacancy.renew_at.
def renew_vacancy():
filters = [
OrderEntryVacancyRenew.remaining,
Vacancy.status == 0,
or_(
Vacancy.renew_at <= (utcnow() - INTERVAL),
Vacancy.renew_at.is_(None))
]
renew_vacancies = OrderEntryVacancyRenew.query.options(
load_only('remaining', 'vacancy_id')
).order_by(
OrderEntryVacancyRenew.id
).from_self().group_by(
OrderEntryVacancyRenew.vacancy_id
).join(
OrderEntryVacancyRenew.vacancy
).options(
contains_eager(OrderEntryVacancyRenew.vacancy).load_only('renew_at')
).filter(*filters)
for entry in renew_vacancies:
entry.vacancy.renew_at = utcnow()
entry.remaining -= 1
db.session.commit()
I wrote the unit test to check renew_vacancy
vacancy1 = Vacancy(id=10000)
vacancy2 = Vacancy(id=10001)
db.session.add_all([vacancy1, vacancy2])
vacancy_renew1 = OrderEntryVacancyRenew(
vacancy_id=vacancy1.id,
remaining=24)
# make sure vacancy_renew1.id < vacancy_renew2.id
db.session.add(vacancy_renew1)
db.session.commit()
vacancy_renew2 = OrderEntryVacancyRenew(
vacancy_id=vacancy1.id,
remaining=8)
vacancy_renew3 = OrderEntryVacancyRenew(
vacancy_id=vacancy2.id,
remaining=42)
db.session.add_all((vacancy_renew2, vacancy_renew3))
db.session.commit()
renew_vacancy()
self.assertEqual(
(vacancy_renew1.remaining, vacancy_renew2.remaining), (23, 8))
renew_vacancies is order by OrderEntryVacancyRenew id and group by Vacancy id, so I expect it will filter vacancy_renew1 and vacancy_renew3.
I used the following command to run the unit test 100 times:
for i in `seq 1 100`; do python test.py; done
In some rare situations, it filters vacancy_renew2 instead of vacancy_renew1.
Why does it happen that sometimes order by does not work as expected?
I try to print vacancy_renew1.id and vacancy_renew2.id after renew_vacancy.
...
db.session.commit()
renew_vacancy()
print vacancy_renew1.id
print vacancy_renew2.id
self.assertEqual(
(vacancy_renew1.remaining, vacancy_renew2.remaining), (23, 8))
...

Why does it happen that sometimes ORDER BY does not work as expected?
Given standard SQL, the results of your query are indeterminate, so it is not very valuable to know why it works most of the time and fails rarely. There are two things that make the results vary:
Generally you should not rely on the order of rows of a subquery in enclosing queries, even if you apply an ordering. Some database implementations may have additional guarantees, but on others for example optimizers may deem the ORDER BY unnecessary – which MySQL 5.7 and up does, it removes the subquery entirely.
Usually databases, such as SQLite and MySQL1, that allow selecting non-aggregate items that are not functionally dependent on, or are not named in the GROUP BY clause, leave it unspecified from which row in the group the values are taken:
SQLite:
... Otherwise, it is evaluated against a single arbitrarily chosen row from within the group. If there is more than one non-aggregate expression in the result-set, then all such expressions are evaluated for the same row.
MySQL:
... In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause.
Trying out the query on SQLite failed the assertion on this machine, while on MySQL it passed. This is probably due to implementation of selecting the row from within the group, but you should not rely on such details.
What you seem to be after is a greatest-n-per-group query, or top-1. Not knowing which database you are using here's a somewhat generic way to do just that using an EXISTS subquery expression:
renew_alias = db.aliased(OrderEntryVacancyRenew)
renew_vacancies = db.session.query(OrderEntryVacancyRenew).\
join(OrderEntryVacancyRenew.vacancy).\
options(
load_only('remaining'),
contains_eager(OrderEntryVacancyRenew.vacancy).load_only('renew_at')).\
filter(db.not_(db.exists().where(
db.and_(renew_alias.vacancy_id == OrderEntryVacancyRenew.vacancy_id,
renew_alias.id < OrderEntryVacancyRenew.id)))).\
filter(*filters)
This query passes the assertion on both SQLite and MySQL. Alternatively you could replace the EXISTS subquery expression with a LEFT JOIN and IS NULL predicate.
P.s. I suppose you're using some flavour of MySQL and following advice such as this. You should read the commentary on that one as well, since there are many people rightly pointing out the pitfalls. It does not work, at least for MySQL 5.7 and up.
1: Controllable with the ONLY_FULL_GROUP_BY SQL mode setting, enabled by default in MySQL 5.7.5 and up.

Filter query by linked object key in SQLAlchemy

Judging by the title this would be the exact same question, but I can't see how any of the answers are applicable to my use case:
I have two classes and a relationship between them:
treatment_association = Table('tr_association', Base.metadata,
Column('chronic_treatments_id', Integer, ForeignKey('chronic_treatments.code')),
Column('animals_id', Integer, ForeignKey('animals.id'))
)
class ChronicTreatment(Base):
__tablename__ = "chronic_treatments"
code = Column(String, primary_key=True)
class Animal(Base):
__tablename__ = "animals"
treatment = relationship("ChronicTreatment", secondary=treatment_association, backref="animals")
I would like to be able to select only the animals which have undergon a treatment which has the code "X". I tried quite a few approaches.
This one fails with an AttributeError:
sql_query = session.query(Animal.treatment).filter(Animal.treatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
And this one happily returns a list of unfiltered animals:
sql_query = session.query(Animal.treatment).filter(ChronicTreatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
The closest thing to the example from the aforementioned thread I could put together:
subq = session.query(Animal.id).subquery()
sql_query = session.query(ChronicTreatment).join((subq, subq.c.treatment_id=="chrFlu"))
for item in sql_query:
pass
mystring = str(session.query(Animal))
mydf = pd.read_sql_query(mystring,engine)
Also fails with an AttributeError.
Can you hel me sort this list?

First, there are two issues with table definitions:
1) In the treatment_association you have Integer column pointing to chronic_treatments.code while the code is String column.
I think it's just better to have an integer id in the chronic_treatments, so you don't duplicate the string code in another table and also have a chance to add more fields to chronic_treatments later.
Update: not exactly correct, you still can add more fields, but it will be more complex to change your 'code' if you decide to rename it.
2) In the Animal model you have a relation named treatment. This is confusing because you have many-to-many relation, it should be plural - treatments.
After fixing the above two, it should be clearer why your queries did not work.
This one (I replaced treatment with treatments:
sql_query = session.query(Animal.treatments).filter(
Animal.treatments.code == "chrFlu")
The Animal.treatments represents a many-to-many relation, it is not an SQL Alchemy mode, so you can't pass it to the query nor use in a filter.
Next one can't work for the same reason (you pass Animal.treatments into the query.
The last one is closer, you actually need join to get your results.
I think it is easier to understand the query as SQL (and you anyway need to know SQL to be able to use sqlalchemy):
animals = session.query(Animal).from_statement(text(
"""
select distinct animals.* from animals
left join tr_association assoc on assoc.animals_id = animals.id
left join chronic_treatments on chronic_treatments.id = assoc.chronic_treatments_id
where chronic_treatments.code = :code
""")
).params(code='chrFlu')
It will select animals and join chronic_treatments through the tr_association and filter the result by code.
Having this it is easy to rewrite it using SQL-less syntax:
sql_query = session.query(Animal).join(Animal.treatments).filter(
ChronicTreatment.code == "chrFlu")
That will return what you want - a list of animals who have related chronic treatment with given code.

PostgreSQL WHERE EXISTS

I'm having trouble wrapping my head around the right way to use EXISTS (and whether there is a right way to use EXISTS for this particular case, or if I'm misunderstanding it).
I'm working against the Rigor schema (defined for SQLAlchemy here: https://github.com/blindsightcorp/rigor/blob/master/lib/types.py ).
The short of it is I have three tables I care about: "percept", "annotation", and "annotation_property". annotation_properties have an annotation_id, annotations have a percept_id.
I want to find all of the percepts that have annotations with a specific annotation_property (FOO=BAR).
Percepts may have many annotations that have a specific property, so it seems like an EXISTS should make things faster.
The (relatively slow) option is:
SELECT DISTINCT(percept.*) FROM percept, annotation, annotation_property
WHERE percept.id = annotation.percept_id AND
annotation_property.annotation_id = annotation.id AND
annotation_property.name = 'FOO' AND annotation_property.value = 'BAR';
How would I use EXISTS to optimize this?
It feels like the first step is something like:
SELECT percept.* FROM percept WHERE id IN (SELECT percept_id FROM
annotation, annotation_property WHERE
annotation.id = annotation_property.annotation_id AND
annotation_property.name = 'FOO' AND annotation_property.value = 'BAR');
But I don't see where to go from here....

To begin with, use ANSI JOIN syntax to distinguish your join conditions from your filter conditions. The result is easier to read, and it better displays the structure of your data:
SELECT DISTINCT(percept.*)
FROM
percept
JOIN annotation ON percept.id = annotation.percept_id
JOIN annotation_property ON annotation_property.annotation_id = annotation.id
WHERE
annotation_property.name = 'FOO'
AND annotation_property.value = 'BAR'
;
It would probably be an improvement to do as you said, and use distinct on the primary key column instead of on a whole percept row at a time, but that still likely involves computing a large result set and then merging it down. It is an alternative to an exists() condition, not a supplement to one.
Employing an EXISTS condition in the WHERE clause might look like this:
SELECT *
FROM percept p
WHERE EXISTS (
SELECT *
FROM
annotation ann
JOIN annotation_property anp
ON anp.annotation_id = ann.id
WHERE
anp.name = 'FOO'
AND anp.value = 'BAR'
AND ann.percept_id = p.id
)
;

The problem with your original query (apart from the implicit join syntax), is that you are bringing together lots of rows from the joins. Then you are aggregating to remove duplicates.
You can eliminate the duplication removal by just selecting from one table:
SELECT p.*
FROM percept p
WHERE EXISTS (SELECT 1
FROM annotation a JOIN
annotation_property ap
ON ap.annotation_id = a.id AND
ap.name = 'FOO' AND ap.value = 'BAR'
WHERE p.id = a.percept_id
) ;
This assumes that the rows in percept do not have duplicates, but that seems like a reasonable assumption.

'Don't care' for a column in SQLite queries?

I've got a SQLite query, which depends on 2 variables, gender and hand. Each of these can have 3 values, 2 which actually mean something (so male/female and left/right) and the third is 'all'. If a variable has a value of 'all' then I don't care what the particular value of that column is.
Is it possible to achieve this functionality with a single query, and just changing the variable? I've had a look for a wildcard or don't care operator but haven't been able to find any except for % which doesn't work in this situation.
Obviously I could make a bunch of if statements and have different queries to use for each case but that's not very elegant.
Code:
select_sql = """ SELECT * FROM table
WHERE (gender = ? AND hand = ?)
"""
cursor.execute(select_sql, (gender_var, hand_var))
I.e. this query works if gender_val = 'male' and hand_var = 'left', but not if gender_val or hand_var = 'all'

You can indeed do this with a single query. Simply compare each variable to 'all' in your query.
select_sql = """ SELECT * FROM table
WHERE ((? = 'all' OR gender = ?) AND (? = 'all' OR hand = ?))
"""
cursor.execute(select_sql, (gender_var, gender_var, hand_var, hand_var))
Basically, when gender_var or hand_var is 'all', the first part of each OR expression is always true, so that branch of the AND is always true and matches all records, i.e., it is a no-op in the query.
It might be better to build a query dynamically in Python, however, that has just the fields you actually need to test. It might be noticeably faster, but you'd have to benchmark that to be sure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.