Aggregating joined tables in SQLAlchemy - python

I got this aggregate function working in Django ORM, it counts some values and percents from the big queryset and returns the resulting dictionary.
queryset = Game.objects.prefetch_related(
"timestamp",
"fighters",
"score",
"coefs",
"rounds",
"rounds_view",
"rounds_view_f",
"finishes",
"rounds_time",
"round_time",
"time_coef",
"totals",
).all()
values = queryset.aggregate(
first_win_cnt=Count("score", filter=Q(score__first_score=5)),
min_time_avg=Avg("round_time__min_time"),
# and so on
) # -> dict
I'm trying to achieve the same using SQLAlchemy and this is my tries so far:
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)
.options(
contains_eager(models.Game.fighters),
contains_eager(models.Game.timestamp),
contains_eager(models.Game.coefs),
contains_eager(models.Game.rounds),
contains_eager(models.Game.rounds_view),
contains_eager(models.Game.rounds_view_f),
contains_eager(models.Game.finishes),
contains_eager(models.Game.score),
contains_eager(models.Game.rounds_time),
contains_eager(models.Game.round_time),
contains_eager(models.Game.time_coef),
contains_eager(models.Game.totals),
)
.all()
)
And it gives me an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError)
column "stats_fighters.id" must appear in the GROUP BY clause or be
used in an aggregate function LINE 1: SELECT stats_fighters.id AS
stats_fighters_id, stats_fighter...
I don't really understand why there should be stats_fighters.id in the group by, and why do I need to use group by. Please help.
This is the SQL which generates Django ORM:
SELECT
AVG("stats_roundtime"."min_time") AS "min_time_avg",
COUNT("stats_score"."id") FILTER (
WHERE "stats_score"."first_score" = 5) AS "first_win_cnt"
FROM "stats_game" LEFT OUTER JOIN "stats_roundtime" ON ("stats_game"."id" = "stats_roundtime"."game_id")
LEFT OUTER JOIN "stats_score" ON ("stats_game"."id" = "stats_score"."game_id")

Group by is used in connection with rows that have the same values and you want to calculate a summary. It is often used with sum, max, min or average.
Since SQLAlchemy generates the final SQL command you need to know your table structure and need to find out how to make SQLAlchemy to generate the right SQL command.
Doku says there is a group_by method in SQLAlchemy.
May be this code might help.
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.group_by(models.Game.fighters)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)

func.count is an aggregation function. If any expression in your SELECT clause uses an aggregation, then all expressions in the SELECT must be constant, aggregation, or appear in the GROUP BY.
if you try SELECT a,max(b) the SQL parser would complain that a is not an aggregation or in group by. In your case, you may consider adding models.Game to GROUP BY.

Related

SQLAlchemy - column must appear in the GROUP BY clause or be used in an aggregate function

I am using sqlalchemy with postgresql,
Tables
Shape[id, name, timestamp, user_id]#user_id referring id column in user table
User[id, name]
this query -
query1 = self.session.query(Shape.timestamp, Shape.name, User.id,
extract('day', Shape.timestamp).label('day'),
extract('hour', Shape.timestamp).label('hour'),
func.count(Shape.id).label("total"),
)\
.join(User, User.id==Shape.user_id)\
.filter(Shape.timestamp.between(from_datetime, to_datetime))\
.group_by(Shape.user_id)\
.group_by('hour')\
.all()
this works well in sqlite3+sqlalchemy, but it is not working in postgresql+sqlalchemy
I got this error -> (psycopg2.errors.GroupingError) column "Shape.timestamp" must appear in the GROUP BY clause or be used in an aggregate function
I need to group only by the user_id and the hour in the timestamp, where the Shape.timestamp is the DateTime python object
but, the error saying to add the Shape.timestamp in the group_by function also,
If i add the Shape.timestamp in the group_by, then it shows all the records
If i need to use some function on other columns, then how i will get the other column actual data, is there any way to get the column data as it is without adding in group_by or using some function
How to solve this
This is a basic SQL issue, what if in your group, there is several timestamp values ?
You either need to use an aggregator function (COUNT, MIN, MAX, AVG) or specify it in your GROUP BY.
NB. SQLite allows ungrouped columns in GROUP BY, in which case "it is evaluated against a single arbitrarily chosen row from within the group." (SQLite doc section 2.4)
Try changing the 9 line ->
query1 = self.session.query(Shape.timestamp, Shape.name, User.id,
extract('day', Shape.timestamp).label('day'),
extract('hour', Shape.timestamp).label('hour'),
func.count(Shape.id).label("total"),
)\
.join(User, User.id==Shape.user_id)\
.filter(Shape.timestamp.between(from_datetime, to_datetime))\
.group_by(Shape.user_id)\
.group_by(extract('hour', Shape.timestamp))\
.all()

Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent

I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.
For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries

SQLAlchemy: How to use group_by() correctly (only_full_group_by)?

I'm trying to use the group_by() function of SQLAlchemy with the mysql+mysqlconnector engine:
rows = session.query(MyModel) \
.order_by(MyModel.published_date.desc()) \
.group_by(MyModel.category_id) \
.all()
It works fine with SQLite, but for MySQL I get this error:
[42000][1055] Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column '...' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
I know how to solve it in plain SQL, but I'd like to use the advantages of SQLAlchemy.
What's the proper solution with SQLAlchemy?
Thanks in advance
One way to form the greatest-n-per-group query with well defined behaviour would be to use a LEFT JOIN, looking for MyModel rows per category_id that have no matching row with greater published_date:
my_model_alias = aliased(MyModel)
rows = session.query(MyModel).\
outerjoin(my_model_alias,
and_(my_model_alias.category_id == MyModel.category_id,
my_model_alias.published_date > MyModel.published_date)).\
filter(my_model_alias.id == None).\
all()
This will work in about any SQL DBMS. In SQLite 3.25.0 and MySQL 8 (and many others) you could use window functions to achieve the same:
sq = session.query(
MyModel,
func.row_number().
over(partition_by=MyModel.category_id,
order_by=MyModel.published_date.desc()).label('rn')).\
subquery()
my_model_alias = aliased(MyModel, sq)
rows = session.query(my_model_alias).\
filter(sq.c.rn == 1).\
all()
Of course you could use GROUP BY as well, if you then use the results in a join:
max_pub_dates = session.query(
MyModel.category_id,
func.max(MyModel.published_date).label('published_date')).\
group_by(MyModel.category_id).\
subquery()
rows = session.query(MyModel).\
join(max_pub_dates,
and_(max_pub_dates.category_id == MyModel.category_id,
max_pub_dates.published_date == MyModel.published_date)).\
all()

Get the newest rows for each foreign key ID

I don't want to aggregate any columns. I just want the newest row for each foreign key in a table.
I've tried grouping.
Model.query.order_by(Model.created_at.desc()).group_by(Model.foreign_key_id).all()
# column "model.id" must appear in the GROUP BY clause
And I've tried distinct.
Model.query.order_by(Model.created_at.desc()).distinct(Model.foreign_key_id).all()
# SELECT DISTINCT ON expressions must match initial ORDER BY expressions
This is known as greatest-n-per-group, and for PostgreSQL you can use DISTINCT ON, as in your second example:
SELECT DISTINCT ON (foreign_key_id) * FROM model ORDER BY foreign_key_id, created_at DESC;
In your attempt, you were missing the DISTINCT ON column in your ORDER BY list, so all you had to do was:
Model.query.order_by(Model.foreign_key_id, Model.created_at.desc()).distinct(Model.foreign_key_id)
The solution is to left join an aliased model to itself (with a special join condition). Then filter out the rows that do not have an id.
model = Model
aliased = aliased(Model)
query = model.query.outerjoin(aliased, and_(
aliased.primary_id == model.primary_id,
aliased.created_at > model.created_at))
query = query.filter(aliased.id.is_(None))

Django Subquery with Scalar Value

Is it possible to compare subqueries results with scalar values using Django ORM? I'm having a hard time to convert this:
SELECT payment_subscription.*
FROM payment_subscription payment_subscription
JOIN payment_recurrent payment_recurrent ON payment_subscription.id = payment_recurrent.subscription_id
WHERE
payment_subscription.status = 1
AND (SELECT expiration_date
FROM payment_transaction payment_transaction
WHERE payment_transaction.company_id = payment_subscription.company_id
AND payment_transaction.status IN ('OK', 'Complete')
ORDER BY payment_transaction.expiration_date DESC, payment_transaction.id DESC
LIMIT 1) <= ?
The main points are:
The last comparison of the scalar value of the subquery with an arbitrary parameter.
the join between the subquery and the outer query with the company id
Subscription.objects.annotate(
max_expiraton_date=Max('transaction__expiration_date')
).filter(
status=1,
recurrent__isnull=False, # [inner] join with recurrent
transaction__status__in=['OK', 'Complete'],
max_expiraton_date=date_value
)
This produces other SQL query, but obtains the same Subscription objects.
You can (as of Django 1.11) annotate on a subquery, and slice it to ensure you only get the "first" result. You can then filter on that subquery annotations, by comparing to the value you want.
from django.db.models.expressions import Subquery, OuterRef
expiration_date = Transaction.objects.filter(
company=OuterRef('company'),
status__in=['OK', 'Complete'],
).order_by('-expiration_date').values('expiration_date')[:1]
Subscription.objects.filter(status=1).annotate(
expiration_date=Subquery(expiration_date),
).filter(expiration_date__lte=THE_DATE)
However...
Currently that can result in really poor performance: your database will evaluate the subquery twice (once in the where clause, from the filter, and again in the select clause, from the annotation). There is work underway to resolve this, but it's not currently complete.

Categories

Resources