sqlalchemy joined alias doesn't have columns from both tables - python

All I want is the count from TableA grouped by a column from TableB, but of course I need the item from TableB each count is associated with. Better explained with code:
TableA and B are Model objects.
I'm trying to follow this syntax as best I can.
Trying to run this query:
sq = session.query(TableA).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
But it gives me an AttributeError (sq does not have attrB)
I'm new to SA and I find it difficult to learn. (links to recommended educational resources welcome!)

When you make a subquery out of a select statement, the columns that can be accessed from it must be in the columns clause. Take for example a statement like:
select x, y from mytable where z=5
If we wanted to make a subquery, then GROUP BY 'z', this would not be legal SQL:
select * from (select x, y from mytable where z=5) as mysubquery group by mysubquery.z
Because 'z' is not in the columns clause of "mysubquery" (it's also illegal since 'x' and 'y' should be in the GROUP BY as well, but that's a different issue).
SQLAlchemy works the same exact way. When you say query(..).subquery(), or use the alias() function on a core selectable construct, it means you're wrapping your SELECT statement in parenthesis, giving it a (usually generated) name, and giving it a new .c. collection that has only those columns that are in the "columns" clause, just like real SQL.
So here you'd need to ensure that TableB, at least the column you're dealing with externally, is available. You can also limit the columns clause to just those columns you need:
sq = session.query(TableA.attrA, TableB.attrB).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
Note that the above query probably only works on MySQL, as in general SQL it's illegal to reference any columns that aren't part of an aggregate function, or part of the GROUP BY, when grouping is used. MySQL has a more relaxed (and sloppy) system in this regard.
edit: if you want the results without the zeros:
import collections
letter_count = collections.defaultdict(int)
for count, letter in session.query(func.count(MyClass.id), MyClass.attr).group_by(MyClass.attr):
letter_count[letter] = count
for letter in ["A", "B", "C", "D", "E", ...]:
print "Letter %s has %d elements" % letter_count[letter]
note letter_count[someletter] defaults to zero if otherwise not populated.

Related

Aggregating joined tables in SQLAlchemy

I got this aggregate function working in Django ORM, it counts some values and percents from the big queryset and returns the resulting dictionary.
queryset = Game.objects.prefetch_related(
"timestamp",
"fighters",
"score",
"coefs",
"rounds",
"rounds_view",
"rounds_view_f",
"finishes",
"rounds_time",
"round_time",
"time_coef",
"totals",
).all()
values = queryset.aggregate(
first_win_cnt=Count("score", filter=Q(score__first_score=5)),
min_time_avg=Avg("round_time__min_time"),
# and so on
) # -> dict
I'm trying to achieve the same using SQLAlchemy and this is my tries so far:
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)
.options(
contains_eager(models.Game.fighters),
contains_eager(models.Game.timestamp),
contains_eager(models.Game.coefs),
contains_eager(models.Game.rounds),
contains_eager(models.Game.rounds_view),
contains_eager(models.Game.rounds_view_f),
contains_eager(models.Game.finishes),
contains_eager(models.Game.score),
contains_eager(models.Game.rounds_time),
contains_eager(models.Game.round_time),
contains_eager(models.Game.time_coef),
contains_eager(models.Game.totals),
)
.all()
)
And it gives me an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError)
column "stats_fighters.id" must appear in the GROUP BY clause or be
used in an aggregate function LINE 1: SELECT stats_fighters.id AS
stats_fighters_id, stats_fighter...
I don't really understand why there should be stats_fighters.id in the group by, and why do I need to use group by. Please help.
This is the SQL which generates Django ORM:
SELECT
AVG("stats_roundtime"."min_time") AS "min_time_avg",
COUNT("stats_score"."id") FILTER (
WHERE "stats_score"."first_score" = 5) AS "first_win_cnt"
FROM "stats_game" LEFT OUTER JOIN "stats_roundtime" ON ("stats_game"."id" = "stats_roundtime"."game_id")
LEFT OUTER JOIN "stats_score" ON ("stats_game"."id" = "stats_score"."game_id")
Group by is used in connection with rows that have the same values and you want to calculate a summary. It is often used with sum, max, min or average.
Since SQLAlchemy generates the final SQL command you need to know your table structure and need to find out how to make SQLAlchemy to generate the right SQL command.
Doku says there is a group_by method in SQLAlchemy.
May be this code might help.
q = (
db.query(
models.Game,
func.count(models.Score.first_score)
.filter(models.Score.first_score == 5)
.label("first_win_cnt"),
)
.join(models.Game.fighters)
.filter_by(**fighter_options)
.join(models.Game.timestamp)
.group_by(models.Game.fighters)
.join(
models.Game.coefs,
models.Game.rounds,
models.Game.rounds_view,
models.Game.rounds_view_f,
models.Game.finishes,
models.Game.score,
models.Game.rounds_time,
models.Game.round_time,
models.Game.time_coef,
models.Game.totals,
)
func.count is an aggregation function. If any expression in your SELECT clause uses an aggregation, then all expressions in the SELECT must be constant, aggregation, or appear in the GROUP BY.
if you try SELECT a,max(b) the SQL parser would complain that a is not an aggregation or in group by. In your case, you may consider adding models.Game to GROUP BY.

SQL Alchemy using Or_ looping multiple columns (Pandas Dataframes)

SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.

How to unpack result of sub-query into list-type field to result of original query in peewee?

How to make peewee put ids of related table rows into additional list-like field into resulting query?
I want to make duplicates detecting manager for media files. For each file on my PC I have record in database with fields like
File name, Size, Path, SHA3-512, Perceptual hash, Tags, Comment, Date added, Date changed, etc...
Depending on the situation I want to use different patterns to be used to consider records in table as duplicates.
In the most simple case I want just to see all records having the same hash, so I
subq = Record.select(Record.SHA).group_by(Record.SHA).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA)).order_by(Record.SHA)
for r in q:
process_record_in_some_way(r)
and everything is fine.
But there are lot of cases when I want to use different sets of table columns as grouping patterns. So in the worst case I use all of them except id and "Date added" column to detect exact duplicating rows in database, when I just readded the same file for few times which leads to the monster like
subq = Record.select(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA and Record.Name == q.c.Name and Record.Date == q.c.Date and Record.Size == q.c.Size and Record.Tags == q.c.Tags)).order_by(Record.SHA)
for r in q:
process_record_in_some_way(r)
and this is not the full list of my fields, just example.
Same thing I have to do for other patterns of sets of fields, i.e. duplicating it's list 3 times in select clause, grouping clause of subquery and then listing them all again in joining clause.
I wish I could just group the records with appropriate pattern and peewee would just list ids of all the members of each group into new list field like
q=Record.select(Record, SOME_MAJIC.alias('duplicates')).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1).SOME_ANOTHER_MAJIC
for r in q:
process_group_of_records(r) # r.duplicates == [23, 44, 45, 56, 100], for example
How can I do this? Listing the same parameters trice I really feel like I do something wrong.
You can use GROUP_CONCAT (or for postgres, array_agg) to group and concatenate a list of ids/filenames, whatever.
So for files with the same hash:
query = (Record
.select(Record.sha, fn.GROUP_CONCAT(Record.id).alias('id_list'))
.group_by(Record.sha)
.having(fn.COUNT(Record.id) > 1))
This is a relational database. So you're dealing all the time, everywhere, with tables consisting of rows and columns. There's no "nesting". GROUP_CONCAT is about as close as you can get.

Using function output in SQLAlchemy join clause

I am trying to translate a fairly short bit of SQL into an sqlAlchemy ORM query. The SQL uses Postgres's generate_series to make a set of dates and my goal is to make a set of time series arrays categorized by one of the columns.
The tables (simplified) are very simple:
counts:
-----------------
count (Integer)
day (Date)
placeID (foreign key related to places)
"counts_pkey" PRIMARY KEY (day, placeID)
places:
-----------------
id
name (varchar)
The output I'm after is a time series of counts for each place including null values when counts are not reported for a day. For example, this would correspond to a series over four days:
array_agg | name
-----------------+-------------------
{NULL,0,7,NULL} | A Place
{NULL,1,NULL,2} | Some other place
{5,NULL,3,NULL} | Yet another
I can do this fairly easily by taking a CROSS JOIN on a date range and places and joining that with the counts:
SELECT array_agg(counts.count), places.name
FROM generate_series('2018-11-01', '2018-11-04', interval '1 days') as day
CROSS JOIN places
LEFT OUTER JOIN counts on counts.day = day.day AND counts.PlaceID = places.id
GROUP BY places.name;
What I can't seem to figure out is how to get SQLAlchemy to do this. After a lot of digging, I found an old google groups thread which almost works leading to this:
date_list = select([column('generate_series')])\
.select_from(func.generate_series(backthen, today, '1 day'))\
.alias('date_list')
time_series = db.session.query(Place.name, func.array_agg(Count.count))\
.select_from(date_list)\
.outerjoin(Count, (Count.day == date_list.c.generate_series) & (Count.placeID == Place.id ))\
.group_by(Place.name)
This creates a sub-select for the time series, but it produces a database error:
There is an entry for table "places", but it cannot be referenced from this part of the query.
So my question is: how would you do this in sqlalchemy. Also, I'm open to the idea that this is difficult because my approach with the SQL is bone-headed.
The problem is that given the query construct SQLAlchemy produces a query along the lines of
SELECT ...
FROM places,
(...) AS date_list LEFT OUTER JOIN count ON ... AND count."placeID" = places.id
...
There are 2 FROM-list items: places and the join. Items cannot cross-reference each other1, and hence the error due to places.id in the ON-clause.
SQLAlchemy does not support explicit CROSS JOIN, but on the other hand a CROSS JOIN is equivalent to an INNER JOIN ON (TRUE). You could also omit wrapping the function expression in a subquery and use it as is by giving it an alias:
date_list = func.generate_series(backthen, today, '1 day').alias('gen_day')
time_series = session.query(Place.name, func.array_agg(Count.count))\
.join(date_list, true())\
.outerjoin(Count, (Count.day == column('gen_day')) &
(Count.placeID == Place.id ))\
.group_by(Place.name)
1: Except function-call FROM-items, or using LATERAL.

Why does SQLite3 not yield an error

I am quite new to SQL, but trying to bugfix the output of an SQL-Query. However this question does not concern the bug, but rather why SQLite3 does not yield an error when it should.
I have query string that looks like:
QueryString = ("SELECT e.event_id, "
"count(e.event_id), "
"e.state, "
"MIN(e.boot_time) AS boot_time, "
"e.time_occurred, "
"COALESCE(e.info, 0) AS info "
"FROM events AS e "
"JOIN leg ON leg.id = e.leg_id "
"GROUP BY e.event_id "
"ORDER BY leg.num_leg DESC, "
"e.event_id ASC;\n"
)
This yields an output with no errors.
What I dont understand, is why there is no error when I GROUP BY e.event_id and e.state and e.time_occurred does not contain aggregate-functions and is not part of the GROUP BY statement?
e.state is a string column. e.time_occurred is an integer column.
I am using the QueryString in Python.
In a misguided attempt to be compatible with MySQL, this is allowed. (The non-aggregated column values come from some random row in the group.)
Since SQLite 3.7.11, using min() or max() guarantees that the values in the non-aggregated columns come from the row that has the minimum/maximum value in the group.
SQLite and MySQL allow bare columns in an aggregation query. This is explained in the documentation:
In the query above, the "a" column is part of the GROUP BY clause and
so each row of the output contains one of the distinct values for "a".
The "c" column is contained within the sum() aggregate function and so
that output column is the sum of all "c" values in rows that have the
same value for "a". But what is the result of the bare column "b"? The
answer is that the "b" result will be the value for "b" in one of the
input rows that form the aggregate. The problem is that you usually do
not know which input row is used to compute "b", and so in many cases
the value for "b" is undefined.
Your particular query is:
SELECT e.event_id, count(e.event_id), e.state, MIN(e.boot_time) AS boot_time,
e.time_occurred, COALESCE(e.info, 0) AS info
FROM events AS e JOIN
leg
ON leg.id = e.leg_id "
GROUP BY e.event_id
ORDER BY leg.num_leg DESC, e.event_id ASC;
If e.event_id is the primary key in events, then this syntax is even supported by the ANSI standard, because event_id is sufficient to uniquely define the other columns in a row in events.
If e.event_id is a PRIMARY or UNIQUE key of the table then e.time_occurred is called "functionally dependent" and would not even throw an error in other SQL compliant DBMSs.
However, SQLite has not implemented functional dependency. In the case of SQLite (and MySQL) no error is thrown even for columns that are not functionally dependent on the GROUP BY columns.
SQLite (and MySQL) simply select a random row from the result set to fill the (in SQLite lingo) "bare column", see this.

Categories

Resources