PostgreSQL WHERE EXISTS - python

I'm having trouble wrapping my head around the right way to use EXISTS (and whether there is a right way to use EXISTS for this particular case, or if I'm misunderstanding it).
I'm working against the Rigor schema (defined for SQLAlchemy here: https://github.com/blindsightcorp/rigor/blob/master/lib/types.py ).
The short of it is I have three tables I care about: "percept", "annotation", and "annotation_property". annotation_properties have an annotation_id, annotations have a percept_id.
I want to find all of the percepts that have annotations with a specific annotation_property (FOO=BAR).
Percepts may have many annotations that have a specific property, so it seems like an EXISTS should make things faster.
The (relatively slow) option is:
SELECT DISTINCT(percept.*) FROM percept, annotation, annotation_property
WHERE percept.id = annotation.percept_id AND
annotation_property.annotation_id = annotation.id AND
annotation_property.name = 'FOO' AND annotation_property.value = 'BAR';
How would I use EXISTS to optimize this?
It feels like the first step is something like:
SELECT percept.* FROM percept WHERE id IN (SELECT percept_id FROM
annotation, annotation_property WHERE
annotation.id = annotation_property.annotation_id AND
annotation_property.name = 'FOO' AND annotation_property.value = 'BAR');
But I don't see where to go from here....

To begin with, use ANSI JOIN syntax to distinguish your join conditions from your filter conditions. The result is easier to read, and it better displays the structure of your data:
SELECT DISTINCT(percept.*)
FROM
percept
JOIN annotation ON percept.id = annotation.percept_id
JOIN annotation_property ON annotation_property.annotation_id = annotation.id
WHERE
annotation_property.name = 'FOO'
AND annotation_property.value = 'BAR'
;
It would probably be an improvement to do as you said, and use distinct on the primary key column instead of on a whole percept row at a time, but that still likely involves computing a large result set and then merging it down. It is an alternative to an exists() condition, not a supplement to one.
Employing an EXISTS condition in the WHERE clause might look like this:
SELECT *
FROM percept p
WHERE EXISTS (
SELECT *
FROM
annotation ann
JOIN annotation_property anp
ON anp.annotation_id = ann.id
WHERE
anp.name = 'FOO'
AND anp.value = 'BAR'
AND ann.percept_id = p.id
)
;

The problem with your original query (apart from the implicit join syntax), is that you are bringing together lots of rows from the joins. Then you are aggregating to remove duplicates.
You can eliminate the duplication removal by just selecting from one table:
SELECT p.*
FROM percept p
WHERE EXISTS (SELECT 1
FROM annotation a JOIN
annotation_property ap
ON ap.annotation_id = a.id AND
ap.name = 'FOO' AND ap.value = 'BAR'
WHERE p.id = a.percept_id
) ;
This assumes that the rows in percept do not have duplicates, but that seems like a reasonable assumption.

Related

How does set_group_by works in Django?

I was writing the following query:
claim_query = ClaimBillSum.objects.filter(claim__lob__in = lobObj)\
.annotate(claim_count = Count("claim__claim_id", distinct=True))\
.annotate(claim_bill_sum = Sum("bill_sum"))\
.values("claim__body_part", "claim_count", "claim_bill_sum")\
.order_by("claim__body_part")
When I checked the query property, it was grouped by all properties of the tables related in this query, not only the ones selected in the values() function, when I only wanted to group by claim__body_part.
As I searched for a way to change the group by instruction, I found the query.set_group_by() function, that when applied, fixed the query in the way I wanted:
claim_query.query.set_group_by()
SELECT
"CLAIM"."body_part",
COUNT(DISTINCT "claim_bill_sum"."claim_id") AS "claim_count",
SUM("claim_bill_sum"."bill_sum") AS "claim_bill_sum"
FROM
"claim_bill_sum"
INNER JOIN "CLAIM" ON
("claim_bill_sum"."claim_id" = "CLAIM"."claim_id")
WHERE
"CLAIM"."lob_id" IN (SELECT U0."lob_id" FROM "LOB" U0 WHERE U0."client_id" = 1)
GROUP BY
"CLAIM"."body_part"
ORDER BY
"CLAIM"."body_part" ASC
But I couldn't find any information in Django documentation or anywhere else to better describe how this function works. Why the default group by is selecting all properties, and how .set_group_by() works, selecting exactly the property I wanted?

Create SQL command with a query parameter that checks for NULL but also for other values

I am trying to write a dynamic SQL command using Python / Postgres. In my where clause I want to use a query parameter (which is user defined) that has to look for NULL values within a code column (varchar), but in other cases also for specific numbers.
If I have to check for a certain value I use this:
cursor.execute("""
select
z.code as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
where
z.code = %(own_id)s;
""", {
'own_id': entry
})
However if I have to check for NULL values the SQL and more specifically the WHERE clause would have to be
select
z.code as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
WHERE
z.code IS NULL;
The query parameter as used in the first statement does not work for the second since it can only replace the value on the right side of the equation, but not the operator. I have read here (https://realpython.com/prevent-python-sql-injection/#using-query-parameters-in-sql) that table names can also be substituted using SQL identifiers provided by psycopg2, but could not find out how to replace a whole WHERE clause or at least the operator.
Unfortunately I cannot change the NULL values in the code column (e.g. using a default value) since these NULL values are created through the JOIN operation.
My only option at the moment would be to have different SQL queries based on the input value, but since the SQL query is quite long (I shortened it for this question) and I have many similar queries it would result in a lot of similar code...So how can I make this WHERE clause dynamic?
EDIT:
In addition to the answer marked as correct, I want to add my own solution, which is not so elegant, but might be helpful in more complicated scenarios, as the NULL fields are replaced with the 'default' value of COALESCE:
create view orsc as
select
coalesce(z.code), 'default') as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id;
SELECT
orsc.code as code2
from
orsc
WHERE
code2 = 'default'; //or any other value
EDIT2: See comments of marked answer why a view is probably not necessary at all.
EDIT3: This question is not helpful since it asks only for checking for NULL values. Besides this an IF statement as shown in the answer would substantially increase my whole codebase (each query is quite long and is used often in slightly adapted ways).
Consider COALESCE to give NULL a default value. Below assumes z.code is a varchar or text. If a integer/numeric, change 'default' to a number value (e.g., 9999).
sql = """SELECT
z.code as code
FROM
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
WHERE
COALESCE(z.code, 'default') = %(own_id)s;
"""
cursor.execute(sql, {'own_id': entry})
cursor.execute(sql, {'own_id': 'default'}) # RETURNS NULLs IN z.code
Online Demo
I dont know the phyton syntax, but this is the idea:
condition String;
if own_id = "NULL"{
condition= " z.code IS NULL ";
}else{
condition= " z.code = " + own_id;
}
cursor.execute("""
select
z.code as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
where
%(condition);
""", {
'condition': entry
})
If I understand you correctly, you want to be able to send in null as well as other values and have the correct result returned. That is, the problem is with the comparsion not returning anything if the insent value is null. This would solve the problem - perhaps with a little performance decrease.
cursor.execute("""
select
z.code as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
where
z.code is not distinct from %(own_id)s;
""", {
'own_id': entry
})
Best regards,
Bjarni
As in the solution I had linked in the comments, there's really no way around using an if block. You can try this:
pre_query="""
select
z.code as code
from
s_order as o
LEFT JOIN
s_code as z ON o.id = z.id
where
"""
args={entry}
zcode="z.code = %s"
if entry == None:
zcode="z.code IS NULL"
args={}
end_query=";" // other conditions
full_query=pre_query+zcode+end_query
cursor.execute(full_query,args)
you can use z.code IS NOT DISTINCT FROM %(own_id)s which is like = but where NULLs are treated like normal values (i.e. NULL = NULL, NULL != any non-null value).
See Postgres comparison operators.

Django: aggregate returns a wrong result after using annotate

When aggregating a queryset, I noticed that if I use an annotation before, I get a wrong result. I can't understand why.
The Code
from django.db.models import QuerySet, Max, F, ExpressionWrapper, DecimalField, Sum
from orders.models import OrderOperation
class OrderOperationQuerySet(QuerySet):
def last_only(self) -> QuerySet:
return self \
.annotate(last_oo_pk=Max('order__orderoperation__pk')) \
.filter(pk=F('last_oo_pk'))
#staticmethod
def _hist_price(orderable_field):
return ExpressionWrapper(
F(f'{orderable_field}__hist_unit_price') * F(f'{orderable_field}__quantity'),
output_field=DecimalField())
def ordered_articles_data(self):
return self.aggregate(
sum_ordered_articles_amounts=Sum(self._hist_price('orderedarticle')))
The Test
qs1 = OrderOperation.objects.filter(order__pk=31655)
qs2 = OrderOperation.objects.filter(order__pk=31655).last_only()
assert qs1.count() == qs2.count() == 1 and qs1[0] == qs2[0] # shows that both querysets contains the same object
qs1.ordered_articles_data()
> {'sum_ordered_articles_amounts': Decimal('3.72')} # expected result
qs2.ordered_articles_data()
> {'sum_ordered_articles_amounts': Decimal('3.01')} # wrong result
How is it possible that this last_only annotation method can make the aggregation result different (and wrong)?
The "funny" thing is that is seems to happen only when the order contains articles that have the same hist_price:
Side note
I can confirm that the SQL created by Django ORM is probably wrong, because when I force execution of last_only() and then I call aggregation in a second query, it works as expected.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#combining-multiple-aggregations could be an explanation?
SQL Queries
(note that these are the actual queries but the code above has been slightly simplified, which explains the presence below of COALESCE and "deleted" IS NULL.)
-- qs1.ordered_articles_data()
SELECT
COALESCE(
SUM(
("orders_orderedarticle"."hist_unit_price" * "orders_orderedarticle"."quantity")
),
0) AS "sum_ordered_articles_amounts"
FROM "orders_orderoperation"
LEFT OUTER JOIN "orders_orderedarticle"
ON ("orders_orderoperation"."id" = "orders_orderedarticle"."order_operation_id")
WHERE ("orders_orderoperation"."order_id" = 31655 AND "orders_orderoperation"."deleted" IS NULL)
-- qs2.ordered_articles_data()
SELECT COALESCE(SUM(("__col1" * "__col2")), 0)
FROM (
SELECT
"orders_orderoperation"."id" AS Col1,
MAX(T3."id") AS "last_oo_pk",
"orders_orderedarticle"."hist_unit_price" AS "__col1",
"orders_orderedarticle"."quantity" AS "__col2"
FROM "orders_orderoperation" INNER JOIN "orders_order"
ON ("orders_orderoperation"."order_id" = "orders_order"."id")
LEFT OUTER JOIN "orders_orderoperation" T3
ON ("orders_order"."id" = T3."order_id")
LEFT OUTER JOIN "orders_orderedarticle"
ON ("orders_orderoperation"."id" = "orders_orderedarticle"."order_operation_id")
WHERE ("orders_orderoperation"."order_id" = 31655 AND "orders_orderoperation"."deleted" IS NULL)
GROUP BY
"orders_orderoperation"."id",
"orders_orderedarticle"."hist_unit_price",
"orders_orderedarticle"."quantity"
HAVING "orders_orderoperation"."id" = (MAX(T3."id"))
) subquery
When you use any annotation in the database language(Aggregate Functions) you should to do group by all fields not inside the function, and you can see it inside the subquery
GROUP BY
"orders_orderoperation"."id",
"orders_orderedarticle"."hist_unit_price",
"orders_orderedarticle"."quantity"
HAVING "orders_orderoperation"."id" = (MAX(T3."id"))
As result the goods with the same hist_unit_price and quantity is filtered by max id. So, based on your screen, one of the chocolate or cafe is excluded by the having condition.
A separation to subqueries with smaller joins is a solution to prevent problems with more joins to children objects, possibly with an unnecessary huge Cartesian product of independent sets or a complicated control of the GROUP BY clause in the result SQL by contribution from more elements of the query.
solution: A subquery is used to get primary keys of the last order operations.
A simple query without added joins or groups is used to be not distorted by a possible aggregation on children.
def last_only(self) -> QuerySet:
max_ids = (self.values('order').order_by()
.annotate(last_oo_pk=Max('order__orderoperation__pk'))
.values('last_oo_pk')
)
return self.filter(pk__in=max_ids)
test
ret = (OrderOperationQuerySet(OrderOperation).filter(order__in=[some_order])
.last_only().ordered_articles_data())
executed SQL: (simplified by removing app name prefix order_ and double quetes ")
SELECT CAST(SUM((orderedarticle.hist_unit_price * orderedarticle.quantity))
AS NUMERIC) AS sum_ordered_articles_amounts
FROM orderoperation
LEFT OUTER JOIN orderedarticle ON (orderoperation.id = orderedarticle.order_operation_id)
WHERE (
orderoperation.order_id IN (31655) AND
orderoperation.id IN (
SELECT MAX(U2.id) AS last_oo_pk
FROM orderoperation U0
INNER JOIN order U1 ON (U0.order_id = U1.id)
LEFT OUTER JOIN orderoperation U2 ON (U1.id = U2.order_id)
WHERE U0.order_id IN (31655)
GROUP BY U0.order_id
)
)
The original invalid SQL could be fixed by adding orders_orderedarticle".id to GROUP BY, but only if last_only() and ordered_articles_data() are used together. That is not good readable way.

Filter query by linked object key in SQLAlchemy

Judging by the title this would be the exact same question, but I can't see how any of the answers are applicable to my use case:
I have two classes and a relationship between them:
treatment_association = Table('tr_association', Base.metadata,
Column('chronic_treatments_id', Integer, ForeignKey('chronic_treatments.code')),
Column('animals_id', Integer, ForeignKey('animals.id'))
)
class ChronicTreatment(Base):
__tablename__ = "chronic_treatments"
code = Column(String, primary_key=True)
class Animal(Base):
__tablename__ = "animals"
treatment = relationship("ChronicTreatment", secondary=treatment_association, backref="animals")
I would like to be able to select only the animals which have undergon a treatment which has the code "X". I tried quite a few approaches.
This one fails with an AttributeError:
sql_query = session.query(Animal.treatment).filter(Animal.treatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
And this one happily returns a list of unfiltered animals:
sql_query = session.query(Animal.treatment).filter(ChronicTreatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
The closest thing to the example from the aforementioned thread I could put together:
subq = session.query(Animal.id).subquery()
sql_query = session.query(ChronicTreatment).join((subq, subq.c.treatment_id=="chrFlu"))
for item in sql_query:
pass
mystring = str(session.query(Animal))
mydf = pd.read_sql_query(mystring,engine)
Also fails with an AttributeError.
Can you hel me sort this list?
First, there are two issues with table definitions:
1) In the treatment_association you have Integer column pointing to chronic_treatments.code while the code is String column.
I think it's just better to have an integer id in the chronic_treatments, so you don't duplicate the string code in another table and also have a chance to add more fields to chronic_treatments later.
Update: not exactly correct, you still can add more fields, but it will be more complex to change your 'code' if you decide to rename it.
2) In the Animal model you have a relation named treatment. This is confusing because you have many-to-many relation, it should be plural - treatments.
After fixing the above two, it should be clearer why your queries did not work.
This one (I replaced treatment with treatments:
sql_query = session.query(Animal.treatments).filter(
Animal.treatments.code == "chrFlu")
The Animal.treatments represents a many-to-many relation, it is not an SQL Alchemy mode, so you can't pass it to the query nor use in a filter.
Next one can't work for the same reason (you pass Animal.treatments into the query.
The last one is closer, you actually need join to get your results.
I think it is easier to understand the query as SQL (and you anyway need to know SQL to be able to use sqlalchemy):
animals = session.query(Animal).from_statement(text(
"""
select distinct animals.* from animals
left join tr_association assoc on assoc.animals_id = animals.id
left join chronic_treatments on chronic_treatments.id = assoc.chronic_treatments_id
where chronic_treatments.code = :code
""")
).params(code='chrFlu')
It will select animals and join chronic_treatments through the tr_association and filter the result by code.
Having this it is easy to rewrite it using SQL-less syntax:
sql_query = session.query(Animal).join(Animal.treatments).filter(
ChronicTreatment.code == "chrFlu")
That will return what you want - a list of animals who have related chronic treatment with given code.

'Don't care' for a column in SQLite queries?

I've got a SQLite query, which depends on 2 variables, gender and hand. Each of these can have 3 values, 2 which actually mean something (so male/female and left/right) and the third is 'all'. If a variable has a value of 'all' then I don't care what the particular value of that column is.
Is it possible to achieve this functionality with a single query, and just changing the variable? I've had a look for a wildcard or don't care operator but haven't been able to find any except for % which doesn't work in this situation.
Obviously I could make a bunch of if statements and have different queries to use for each case but that's not very elegant.
Code:
select_sql = """ SELECT * FROM table
WHERE (gender = ? AND hand = ?)
"""
cursor.execute(select_sql, (gender_var, hand_var))
I.e. this query works if gender_val = 'male' and hand_var = 'left', but not if gender_val or hand_var = 'all'
You can indeed do this with a single query. Simply compare each variable to 'all' in your query.
select_sql = """ SELECT * FROM table
WHERE ((? = 'all' OR gender = ?) AND (? = 'all' OR hand = ?))
"""
cursor.execute(select_sql, (gender_var, gender_var, hand_var, hand_var))
Basically, when gender_var or hand_var is 'all', the first part of each OR expression is always true, so that branch of the AND is always true and matches all records, i.e., it is a no-op in the query.
It might be better to build a query dynamically in Python, however, that has just the fields you actually need to test. It might be noticeably faster, but you'd have to benchmark that to be sure.

Categories

Resources