Using function output in SQLAlchemy join clause - python

I am trying to translate a fairly short bit of SQL into an sqlAlchemy ORM query. The SQL uses Postgres's generate_series to make a set of dates and my goal is to make a set of time series arrays categorized by one of the columns.
The tables (simplified) are very simple:
counts:
-----------------
count (Integer)
day (Date)
placeID (foreign key related to places)
"counts_pkey" PRIMARY KEY (day, placeID)
places:
-----------------
id
name (varchar)
The output I'm after is a time series of counts for each place including null values when counts are not reported for a day. For example, this would correspond to a series over four days:
array_agg | name
-----------------+-------------------
{NULL,0,7,NULL} | A Place
{NULL,1,NULL,2} | Some other place
{5,NULL,3,NULL} | Yet another
I can do this fairly easily by taking a CROSS JOIN on a date range and places and joining that with the counts:
SELECT array_agg(counts.count), places.name
FROM generate_series('2018-11-01', '2018-11-04', interval '1 days') as day
CROSS JOIN places
LEFT OUTER JOIN counts on counts.day = day.day AND counts.PlaceID = places.id
GROUP BY places.name;
What I can't seem to figure out is how to get SQLAlchemy to do this. After a lot of digging, I found an old google groups thread which almost works leading to this:
date_list = select([column('generate_series')])\
.select_from(func.generate_series(backthen, today, '1 day'))\
.alias('date_list')
time_series = db.session.query(Place.name, func.array_agg(Count.count))\
.select_from(date_list)\
.outerjoin(Count, (Count.day == date_list.c.generate_series) & (Count.placeID == Place.id ))\
.group_by(Place.name)
This creates a sub-select for the time series, but it produces a database error:
There is an entry for table "places", but it cannot be referenced from this part of the query.
So my question is: how would you do this in sqlalchemy. Also, I'm open to the idea that this is difficult because my approach with the SQL is bone-headed.

The problem is that given the query construct SQLAlchemy produces a query along the lines of
SELECT ...
FROM places,
(...) AS date_list LEFT OUTER JOIN count ON ... AND count."placeID" = places.id
...
There are 2 FROM-list items: places and the join. Items cannot cross-reference each other1, and hence the error due to places.id in the ON-clause.
SQLAlchemy does not support explicit CROSS JOIN, but on the other hand a CROSS JOIN is equivalent to an INNER JOIN ON (TRUE). You could also omit wrapping the function expression in a subquery and use it as is by giving it an alias:
date_list = func.generate_series(backthen, today, '1 day').alias('gen_day')
time_series = session.query(Place.name, func.array_agg(Count.count))\
.join(date_list, true())\
.outerjoin(Count, (Count.day == column('gen_day')) &
(Count.placeID == Place.id ))\
.group_by(Place.name)
1: Except function-call FROM-items, or using LATERAL.

Related

SQLAlchemy - column must appear in the GROUP BY clause or be used in an aggregate function

I am using sqlalchemy with postgresql,
Tables
Shape[id, name, timestamp, user_id]#user_id referring id column in user table
User[id, name]
this query -
query1 = self.session.query(Shape.timestamp, Shape.name, User.id,
extract('day', Shape.timestamp).label('day'),
extract('hour', Shape.timestamp).label('hour'),
func.count(Shape.id).label("total"),
)\
.join(User, User.id==Shape.user_id)\
.filter(Shape.timestamp.between(from_datetime, to_datetime))\
.group_by(Shape.user_id)\
.group_by('hour')\
.all()
this works well in sqlite3+sqlalchemy, but it is not working in postgresql+sqlalchemy
I got this error -> (psycopg2.errors.GroupingError) column "Shape.timestamp" must appear in the GROUP BY clause or be used in an aggregate function
I need to group only by the user_id and the hour in the timestamp, where the Shape.timestamp is the DateTime python object
but, the error saying to add the Shape.timestamp in the group_by function also,
If i add the Shape.timestamp in the group_by, then it shows all the records
If i need to use some function on other columns, then how i will get the other column actual data, is there any way to get the column data as it is without adding in group_by or using some function
How to solve this
This is a basic SQL issue, what if in your group, there is several timestamp values ?
You either need to use an aggregator function (COUNT, MIN, MAX, AVG) or specify it in your GROUP BY.
NB. SQLite allows ungrouped columns in GROUP BY, in which case "it is evaluated against a single arbitrarily chosen row from within the group." (SQLite doc section 2.4)
Try changing the 9 line ->
query1 = self.session.query(Shape.timestamp, Shape.name, User.id,
extract('day', Shape.timestamp).label('day'),
extract('hour', Shape.timestamp).label('hour'),
func.count(Shape.id).label("total"),
)\
.join(User, User.id==Shape.user_id)\
.filter(Shape.timestamp.between(from_datetime, to_datetime))\
.group_by(Shape.user_id)\
.group_by(extract('hour', Shape.timestamp))\
.all()

"Rowtime attributes must not be in the input rows of a regular join" despite using interval join, but only with event timestamp

Example code:
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
env_settings = (
EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
)
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
table_env.execute_sql(
"""
CREATE TABLE table1 (
id INT,
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector.type' = 'filesystem',
'format.type' = 'csv',
'connector.path' = '/home/alex/work/test-flink/data1.csv'
)
"""
)
table_env.execute_sql(
"""
CREATE TABLE table2 (
id2 INT,
ts2 TIMESTAMP(3),
WATERMARK FOR ts2 AS ts2 - INTERVAL '5' SECOND
) WITH (
'connector.type' = 'filesystem',
'format.type' = 'csv',
'connector.path' = '/home/alex/work/test-flink/data2.csv'
)
"""
)
table1 = table_env.from_path("table1")
table2 = table_env.from_path("table2")
print(table1.join(table2).where("ts = ts2 && id = id2").select("id, ts").to_pandas())
Gives an error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.flink.table.runtime.arrow.ArrowUtils.collectAsPandasDataFrame.
: org.apache.flink.table.api.TableException: Cannot generate a valid execution plan for the given query:
FlinkLogicalLegacySink(name=[collect], fields=[id, ts])
+- FlinkLogicalCalc(select=[id, ts])
+- FlinkLogicalJoin(condition=[AND(=($2, $5), =($0, $3))], joinType=[inner])
:- FlinkLogicalCalc(select=[id, ts, CAST(ts) AS ts0])
: +- FlinkLogicalWatermarkAssigner(rowtime=[ts], watermark=[-($1, 5000:INTERVAL SECOND)])
: +- FlinkLogicalLegacyTableSourceScan(table=[[default_catalog, default_database, table1, source: [CsvTableSource(read fields: id, ts)]]], fields=[id, ts])
+- FlinkLogicalCalc(select=[id2, ts2, CAST(ts2) AS ts20])
+- FlinkLogicalWatermarkAssigner(rowtime=[ts2], watermark=[-($1, 5000:INTERVAL SECOND)])
+- FlinkLogicalLegacyTableSourceScan(table=[[default_catalog, default_database, table2, source: [CsvTableSource(read fields: id2, ts2)]]], fields=[id2, ts2])
Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.
This seems different from other similar questions such as this one because I have followed the instructions in the docs and specified both an equi-join and a time interval join (ts = ts2 && id = id2):
An interval join requires at least one equi-join predicate and a join
condition that bounds the time on both sides. Such a condition can be
defined by two appropriate range predicates (<, <=, >=, >) or a single
equality predicate that compares time attributes of the same type
(i.e., processing time or event time) of both input tables.
For example, the following predicates are valid interval join
conditions:
ltime = rtime
If the problem is that these are not append-only tables, I don't know how to make them so.
Setting the time characteristic doesn't help:
StreamExecutionEnvironment.get_execution_environment().set_stream_time_characteristic(
TimeCharacteristic.EventTime
)
If I use processing time instead with ts AS PROCTIME() then the query succeeds. But I think I need to use event time and I don't understand why there's this difference.
Joins between two regular tables in SQL are always expressed in the same way using FROM a, b or a JOIN b.
However, Flink provides two types of join operators under the hood for the same syntax. One is an interval join which requires time attributes to relate both tables with each other based on time. And one is the regular SQL join that is implemented in a generic way as you know it from databases.
Interval joins are just a streaming optimization to keep the state size low during runtime and produce no updates in the result. The regular SQL join operator can produce the same result as the an interval in the end but with higher maintenance costs.
In order to distinguish between interval join and regular join, the optimizer searches for a predicate in the WHERE clause that works on time attributes. For the interval join, the output can always contain two rowtime attributes for outer temporal operations (downstream temporal operators). Because both rowtime attributes are still aligned with the underlying watermarking system. This means that e.g. an outer window or other interval join could work with the time attribute again.
However, the implementation of interval joins has some shortcomings that are known and covered in FLINK-10211. Due to the bad design, we cannot distinguish between an interval join and regular join at certain locations. Thus, we need to assume that the regular join could be an interval join and cannot cast the time attribute to TIMESTAMP for users automatically. Instead we currently forbid time attributes in the output for regular joins.
At some point this limitation will hopefully be gone, until then a user has two possibilities:
Don't use a regular join on tables that contain a time attribute. You can also just project it away with a nested SELECT clause or do a CAST before joining.
Cast the time attribute to a regular timestamp using CAST(col AS TIMESTAMP) in the SELECT clause. It will be pushed down into the join operation.
Your exception indicates that you are using a regular join. Interval joins need a range to operate (even if it is only 1 ms). They don't support equality.

Loop SQL Query Over Dictionary

I have a SQL query that uses the first and the last day of the calendar months to generate a subset of data for a given month. I have been trying to figure out how to loop it for a number of months - i have two lists (one for first and another for last days), two tuples (same), and a dictionary (first and last are keys and values) with all these dates - and store all results in one dataframe and i am struggling very bad.
I can do loop and get all the data if i am only using one list or tuple - then i can loop through it and get all the data. if i try to use two, it simply does not work. Is there a way to do what I am trying to do?
fd=['2018-05-01','2018-06-01','2018-07-01']
ld=['2018-05-31','2018-06-30','2018-07-31']
my_dict=dict(zip(fd, ld))
data_check=pd.DataFrame()
fd_d=','.join(my_dict.keys())
ed_d=','.join(['%%(%s)s' % x for x in my_dict])
query= """
SELECT count(distinct ids),first_date, last_date from table1
where first_date=%s and last_date =%s
group by 2,3
"""
for x in my_dict:
df=pd.read_sql(query% (fd_d,ed_d),my_dict)
data_check=data_check.append(df)
In general, please heed three best practices:
Avoid the quadratic copy of using DataFrame.append in a loop. Instead, build a list of data frames to be concatenated once outside the loop.
Use parameterization and not string concatenation which is supported with pandas read_sql. This avoids the need to string format and punctuate with quotes.
Discontinue using the modulo operator, %, for string concatenation as it is de-emphasised (not officially deprecated). Instead, use the superior str.format.
Specifically, for your needs iterate elementwise between two lists using zip without layering it in a dictionary:
query= """SELECT count(distinct ids), first_date, last_date
FROM table1
WHERE first_date = %s and last_date = %s
GROUP BY 2, 3"""
df_list = []
for f, l in zip(fd, ld):
df = pd.read_sql(query, conn, params=[f, l])
df_list.append(df)
final_df = pd.concat(df_list)
Alternatively, avoid the loop and parameters by aggregating on first and last of days of every month in table:
query= """SELECT count(distinct ids), first_date, last_date
FROM table1
WHERE DATE_PART(d, first_date) = 1
AND last_date = LAST_DAY(first_date)
GROUP BY 2, 3
ORDER BY 2, 3"""
final_df = pd.read_sql(query, conn)

Returning ranked search results using gin index with sqlalchemy

I have a GIN index set up for full text search. I would like to get a list of records that match a search query, ordered by rank (how well the record matched the search query). For the result, I only need the record and its columns, I do not need the actual rank value that was used for ordering.
I have the following query, which runs fine and returns the expected results from my postgresql db.
SELECT *, ts_rank('{0.1,0.1,0.1,0.1}', users.textsearchable_index_col, to_tsquery('smit:* | ji:*')) AS rank
FROM users
WHERE users.authentication_method != 2 AND users.textsearchable_index_col ## to_tsquery('smith:* | ji:*') ORDER
BY rank desc;
I would like to perform this query using sqlalchemy(SA). I understand that 'ts_rank' does not come ready to use in SA. I have tried a number of things, such as
proxy = self.db_session.query(User, text(
"""ts_rank('{0.1,0.1,0.1,0.1}', users.textsearchable_index_col, to_tsquery(:search_str1)) as rank""")). \
filter(User.authentication_method != 2,
text("""users.textsearchable_index_col ## to_tsquery(:search_str2)""")). \
params(search_str1=search, search_str2=search). \
order_by("rank")
and also read about using column property, although I'm not sure if/how I would use that in the solution.
would appreciate a nudge in the right direction.
You can use SQL functions in your queries by using SQLAlchemy func
from sqlalchemy.sql.expression import func
(db.session.query(User, func.ts_rank('{0.1,0.1,0.1,0.1}', User.textsearchable_index_col, func.to_tsquery('smit:* | ji:*')).label('rank'))
.filter(User.authentication_method != 2)
.filter(User.textsearchable_index_col.op('##')(func.to_tsquery('smit:* | ji:*')))
.order_by('rank desc')
).all()

sqlalchemy joined alias doesn't have columns from both tables

All I want is the count from TableA grouped by a column from TableB, but of course I need the item from TableB each count is associated with. Better explained with code:
TableA and B are Model objects.
I'm trying to follow this syntax as best I can.
Trying to run this query:
sq = session.query(TableA).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
But it gives me an AttributeError (sq does not have attrB)
I'm new to SA and I find it difficult to learn. (links to recommended educational resources welcome!)
When you make a subquery out of a select statement, the columns that can be accessed from it must be in the columns clause. Take for example a statement like:
select x, y from mytable where z=5
If we wanted to make a subquery, then GROUP BY 'z', this would not be legal SQL:
select * from (select x, y from mytable where z=5) as mysubquery group by mysubquery.z
Because 'z' is not in the columns clause of "mysubquery" (it's also illegal since 'x' and 'y' should be in the GROUP BY as well, but that's a different issue).
SQLAlchemy works the same exact way. When you say query(..).subquery(), or use the alias() function on a core selectable construct, it means you're wrapping your SELECT statement in parenthesis, giving it a (usually generated) name, and giving it a new .c. collection that has only those columns that are in the "columns" clause, just like real SQL.
So here you'd need to ensure that TableB, at least the column you're dealing with externally, is available. You can also limit the columns clause to just those columns you need:
sq = session.query(TableA.attrA, TableB.attrB).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
Note that the above query probably only works on MySQL, as in general SQL it's illegal to reference any columns that aren't part of an aggregate function, or part of the GROUP BY, when grouping is used. MySQL has a more relaxed (and sloppy) system in this regard.
edit: if you want the results without the zeros:
import collections
letter_count = collections.defaultdict(int)
for count, letter in session.query(func.count(MyClass.id), MyClass.attr).group_by(MyClass.attr):
letter_count[letter] = count
for letter in ["A", "B", "C", "D", "E", ...]:
print "Letter %s has %d elements" % letter_count[letter]
note letter_count[someletter] defaults to zero if otherwise not populated.

Categories

Resources