SQLAlchemy counting subquery result - python

I have an SQL query which I want to convert to use the ORM but I cannot get the ORM to count the results from the subquery.
So my working SQL is:
select FOO
,BAR
,TOTALCOUNT
from(
select FOO
,BAR
,COUNT(BAR) OVER (PARTITION BY FOO) AS TOTALCOUNT
from(
SELECT distinct
[FOO]
,[BAR]
FROM [database].[dbo].[table]
)m
)m
WHERE TOTALCOUNT > 10
I have tried to create the equivalent code using the ORM but my final result has just 1's for the final count, the code I have tried is below
subs = session.query(table.FOO,table.BAR).filter(
table.date > datetime.now() - timedelta(days=10),
).distinct().subquery()
result = pd.read_sql(session.query(subs.c.FOO,subs.c.BAR,func.count(subs.c.BAR).label('TOTALCOUNT')).group_by(subs.c.FOO,subs.c.BAR).statement,session.bind)
I have also tried to do it in one query with:
result = pd.read_sql(session.query(table.FOO,table.BAR,func.count(table.BAR).label("TOTALCOUNT")).filter(
and_(
table.date> datetime.now() - timedelta(days= 30),
)
),groupby.order_by(table.FOO).distinct().statement,session.bind)
But that is counting the columns before applying the distinct operator so the count is incorrect. I would really appreciate if someone could assist me or tell me where I am going wrong, I have googled all morning and cant seem to find an answer.

ahh im an idiot, should pay more attention to what I am doing, added the alias and then removed an additional column i was grouping by. However should anyone else ever struggle with something similar here is the working code.
subs = session.query(table.FOO,table.BAR).filter(
table.date > datetime.now() - timedelta(days=10),
).distinct().subquery().alias('subs')
result = pd.read_sql(session.query(subs.c.FOO,func.count(subs.c.BAR).label('TOTALCOUNT'))./
group_by(subs.c.FOO).statement,session.bind)

Related

Confusing SQLAlchemy conversion of simple subquery

I've been wrestling with what should be a simple conversion of a straightforward SQL query into an SQLAlchemy expression, and I just cannot get things to line up the way I mean in the subquery. This is a single-table query of a "Comments" table; I want to find which users have made the most first comments:
SELECT user_id, count(*) AS count
FROM comments c
where c.date = (SELECT MIN(c2.date)
FROM comments c2
WHERE c2.post_id = c.post_id
)
GROUP BY user_id
ORDER BY count DESC
LIMIT 20;
I don't know how to write the subquery so that it refers to the outer query, and if I did, I wouldn't know how to assemble this into the outer query itself. (Using MySQL, which shouldn't matter.)
Well, after giving up for a while and then looking back at it, I came up with something that works. I'm sure there's a better way, but:
c2 = aliased(Comment)
firstdate = select([func.min(c2.date)]).\
where(c2.post_id == Comment.post_id).\
as_scalar() # or scalar_subquery(), in SQLA 1.4
users = session.query(
Comment.user_id, func.count('*').label('count')).\
filter(Comment.date == firstdate).\
group_by(Comment.user_id).\
order_by(desc('count')).\
limit(20)

Django SQL Windows Function with Group By

I have this MariaDB Query
SELECT
DAYOFWEEK(date) AS `week_day`,
SUM(revenue)/SUM(SUM(revenue)) OVER () AS `rev_share` FROM orders
GROUP BY DAYOFWEEK(completed)
It will result in a table which show the revenue share.
My goal is to archive the same with the help of Django's ORM layer.
I tried the following by using RawSQL:
share = Orders.objects.values(week_day=ExtractIsoWeekDay('date')) \
.annotate(revenue_share=RawSQL('SUM(revenue)/SUM(SUM(revenue)) over ()'))
This results in a single value without a group by. The query which is executed:
SELECT
WEEKDAY(`orders`.`date`) + 1 AS `week_day`,
(SUM(revenue)/SUM(SUM(revenue)) over ()) AS `revenue_share`
FROM `orders`
And also this by using the Window function:
share = Orders.objects.values(week_day=ExtractIsoWeekDay('date')) \
.annotate(revenue_share=Sum('revenue')/Window(Sum('revenue')))
Which results into the following query
SELECT
WEEKDAY(`order`.`date`) + 1 AS `week_day`,
(SUM(`order`.`revenue`) / SUM(`order`.`revenue`) OVER ()) AS `rev_share`
FROM `order` GROUP BY WEEKDAY(`order`.`date`) + 1 ORDER BY NULL
But the data is completely wrong. It looks like the Window is not using the whole table.
Thanks for your help in advance.

Fast complex SQL queries on PostgreSQL database with Python

I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function

peewee select() return SQL query, not the actual data

I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))

SQLAlchemy subquery in from clause without join

i need a little help.
I have following query and i'm, curious about how to represent it in terms of sqlalchemy.orm. Currently i'm executing it by session.execute. Its not critical for me, but i'm just curious. The thing that i'm actually don't know is how to put subquery in FROM clause (nested view) without doing any join.
select g_o.group_ from (
select distinct regexp_split_to_table(g.group_name, E',') group_
from (
select array_to_string(groups, ',') group_name
from company
where status='active'
and array_to_string(groups, ',') like :term
limit :limit
) g
) g_o
where g_o.group_ like :term
order by 1
limit :limit
I need this subquery thing because of speed issue - without limit in the most inner query function regexp_split_to_table starts to parse all data and does limit only after that. But my table is huge and i cannot afford that.
If something is not very clear, please, ask, i'll do my best)
I presume this is PostgreSQL.
To create a subquery, use subquery() method. The resulting object can be used as if it were Table object. Here's how your query would look like in SQLAlchemy:
subq1 = session.query(
func.array_to_string(Company.groups, ',').label('group_name')
).filter(
(Company.status == 'active') &
(func.array_to_string(Company.groups, ',').like(term))
).limit(limit).subquery()
subq2 = session.query(
func.regexp_split_to_table(subq1.c.group_name, ',')
.distinct()
.label('group')
).subquery()
q = session.query(subq2.c.group).\
filter(subq2.c.group.like(term)).\
order_by(subq2.c.group).\
limit(limit)
However, you could avoid one subquery by using unnest function instead of converting array to string with arrayt_to_string and then splitting it with regexp_split_to_table:
subq = session.query(
func.unnest(Company.groups).label('group')
).filter(
(Company.status == 'active') &
(func.array_to_string(Company.groups, ',').like(term))
).limit(limit).subquery()
q = session.query(subq.c.group.distinct()).\
filter(subq.c.group.like(term)).\
order_by(subq.c.group).\
limit(limit)

Categories

Resources