Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent - python

I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.

For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries

Related

Aggregating joined tables in SQLAlchemy

I got this aggregate function working in Django ORM, it counts some values and percents from the big queryset and returns the resulting dictionary.
queryset = Game.objects.prefetch_related(
"timestamp",
"fighters",
"score",
"coefs",
"rounds",
"rounds_view",
"rounds_view_f",
"finishes",
"rounds_time",
"round_time",
"time_coef",
"totals",
).all()
values = queryset.aggregate(
first_win_cnt=Count("score", filter=Q(score__first_score=5)),
min_time_avg=Avg("round_time__min_time"),
# and so on
) # -> dict
I'm trying to achieve the same using SQLAlchemy and this is my tries so far:
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)
.options(
contains_eager(models.Game.fighters),
contains_eager(models.Game.timestamp),
contains_eager(models.Game.coefs),
contains_eager(models.Game.rounds),
contains_eager(models.Game.rounds_view),
contains_eager(models.Game.rounds_view_f),
contains_eager(models.Game.finishes),
contains_eager(models.Game.score),
contains_eager(models.Game.rounds_time),
contains_eager(models.Game.round_time),
contains_eager(models.Game.time_coef),
contains_eager(models.Game.totals),
)
.all()
)
And it gives me an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError)
column "stats_fighters.id" must appear in the GROUP BY clause or be
used in an aggregate function LINE 1: SELECT stats_fighters.id AS
stats_fighters_id, stats_fighter...
I don't really understand why there should be stats_fighters.id in the group by, and why do I need to use group by. Please help.
This is the SQL which generates Django ORM:
SELECT
AVG("stats_roundtime"."min_time") AS "min_time_avg",
COUNT("stats_score"."id") FILTER (
WHERE "stats_score"."first_score" = 5) AS "first_win_cnt"
FROM "stats_game" LEFT OUTER JOIN "stats_roundtime" ON ("stats_game"."id" = "stats_roundtime"."game_id")
LEFT OUTER JOIN "stats_score" ON ("stats_game"."id" = "stats_score"."game_id")
Group by is used in connection with rows that have the same values and you want to calculate a summary. It is often used with sum, max, min or average.
Since SQLAlchemy generates the final SQL command you need to know your table structure and need to find out how to make SQLAlchemy to generate the right SQL command.
Doku says there is a group_by method in SQLAlchemy.
May be this code might help.
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.group_by(models.Game.fighters)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)
func.count is an aggregation function. If any expression in your SELECT clause uses an aggregation, then all expressions in the SELECT must be constant, aggregation, or appear in the GROUP BY.
if you try SELECT a,max(b) the SQL parser would complain that a is not an aggregation or in group by. In your case, you may consider adding models.Game to GROUP BY.

Filter by results of select

I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()
Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()

SQLAlchemy: How to use group_by() correctly (only_full_group_by)?

I'm trying to use the group_by() function of SQLAlchemy with the mysql+mysqlconnector engine:
rows = session.query(MyModel) \
.order_by(MyModel.published_date.desc()) \
.group_by(MyModel.category_id) \
.all()
It works fine with SQLite, but for MySQL I get this error:
[42000][1055] Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column '...' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
I know how to solve it in plain SQL, but I'd like to use the advantages of SQLAlchemy.
What's the proper solution with SQLAlchemy?
Thanks in advance
One way to form the greatest-n-per-group query with well defined behaviour would be to use a LEFT JOIN, looking for MyModel rows per category_id that have no matching row with greater published_date:
my_model_alias = aliased(MyModel)
rows = session.query(MyModel).\
outerjoin(my_model_alias,
and_(my_model_alias.category_id == MyModel.category_id,
my_model_alias.published_date > MyModel.published_date)).\
filter(my_model_alias.id == None).\
all()
This will work in about any SQL DBMS. In SQLite 3.25.0 and MySQL 8 (and many others) you could use window functions to achieve the same:
sq = session.query(
MyModel,
func.row_number().
over(partition_by=MyModel.category_id,
order_by=MyModel.published_date.desc()).label('rn')).\
subquery()
my_model_alias = aliased(MyModel, sq)
rows = session.query(my_model_alias).\
filter(sq.c.rn == 1).\
all()
Of course you could use GROUP BY as well, if you then use the results in a join:
max_pub_dates = session.query(
MyModel.category_id,
func.max(MyModel.published_date).label('published_date')).\
group_by(MyModel.category_id).\
subquery()
rows = session.query(MyModel).\
join(max_pub_dates,
and_(max_pub_dates.category_id == MyModel.category_id,
max_pub_dates.published_date == MyModel.published_date)).\
all()

Nested SELECT query in Pyspark DataFrames

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?
As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

SELECT * in SQLAlchemy?

Is it possible to do SELECT * in SQLAlchemy?
Specifically, SELECT * WHERE foo=1?
Is no one feeling the ORM love of SQLAlchemy today? The presented answers correctly describe the lower-level interface that SQLAlchemy provides. Just for completeness, this is the more-likely (for me) real-world situation where you have a session instance and a User class that is ORM mapped to the users table.
for user in session.query(User).filter_by(name='jack'):
print(user)
# ...
And this does an explicit select on all columns.
The following selection works for me in the core expression language (returning a RowProxy object):
foo_col = sqlalchemy.sql.column('foo')
s = sqlalchemy.sql.select(['*']).where(foo_col == 1)
If you don't list any columns, you get all of them.
query = users.select()
query = query.where(users.c.name=='jack')
result = conn.execute(query)
for row in result:
print row
Should work.
You can always use a raw SQL too:
str_sql = sql.text("YOUR STRING SQL")
#if you have some args:
args = {
'myarg1': yourarg1
'myarg2': yourarg2}
#then call the execute method from your connection
results = conn.execute(str_sql,args).fetchall()
Where Bar is the class mapped to your table and session is your sa session:
bars = session.query(Bar).filter(Bar.foo == 1)
Turns out you can do:
sa.select('*', ...)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(rows, columns=rows.keys())
For joins if columns are not defined manually, only columns of target table are returned. To get all columns for joins(User table joined with Group Table:
sql = User.select(from_obj(Group, User.c.group_id == Group.c.id))
# Add all coumns of Group table to select
sql = sql.column(Group)
session.connection().execute(sql)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(dict(zip(r.keys(), r)) for r in rows)
If you're using the ORM, you can build a query using the normal ORM constructs and then execute it directly to get raw column values:
query = session.query(User).filter_by(name='jack')
for cols in session.connection().execute(query):
print cols
every_column = User.__table__.columns
records = session.query(*every_column).filter(User.foo==1).all()
When a ORM class is passed to the query function, e.g. query(User), the result will be composed of ORM instances. In the majority of cases, this is what the dev wants and will be easiest to deal with--demonstrated by the popularity of the answer above that corresponds to this approach.
In some cases, devs may instead want an iterable sequence of values. In these cases, one can pass the list of desired column objects to query(). This answer shows how to pass the entire list of columns without hardcoding them, while still working with SQLAlchemy at the ORM layer.

Categories

Resources