Django SQL Windows Function with Group By - python

I have this MariaDB Query
SELECT
DAYOFWEEK(date) AS `week_day`,
SUM(revenue)/SUM(SUM(revenue)) OVER () AS `rev_share` FROM orders
GROUP BY DAYOFWEEK(completed)
It will result in a table which show the revenue share.
My goal is to archive the same with the help of Django's ORM layer.
I tried the following by using RawSQL:
share = Orders.objects.values(week_day=ExtractIsoWeekDay('date')) \
.annotate(revenue_share=RawSQL('SUM(revenue)/SUM(SUM(revenue)) over ()'))
This results in a single value without a group by. The query which is executed:
SELECT
WEEKDAY(`orders`.`date`) + 1 AS `week_day`,
(SUM(revenue)/SUM(SUM(revenue)) over ()) AS `revenue_share`
FROM `orders`
And also this by using the Window function:
share = Orders.objects.values(week_day=ExtractIsoWeekDay('date')) \
.annotate(revenue_share=Sum('revenue')/Window(Sum('revenue')))
Which results into the following query
SELECT
WEEKDAY(`order`.`date`) + 1 AS `week_day`,
(SUM(`order`.`revenue`) / SUM(`order`.`revenue`) OVER ()) AS `rev_share`
FROM `order` GROUP BY WEEKDAY(`order`.`date`) + 1 ORDER BY NULL
But the data is completely wrong. It looks like the Window is not using the whole table.
Thanks for your help in advance.

Related

cx_Oracle DatabaseError: ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

I am using python version 3.6.8.
Recently using code I have been using for several months over jupyter notebook, I began getting the above error. The code uses cx_Oracle to connect run several queries. I have never gotten this error before when running these queries, and the queries generally completed within 1 minute.
The code creates a sql query string in the following form (here I use *** to mean the query is actually much longer but I have stripped it to the basics):
def function(d1,d2):
conn=cx_Oracle.connect(dsn="connection")
sqlstr=f"""select *** from table
where date between {d1} and {d2}
"""
query1=pd.read_sql(f"""
select * from
({sqlstr})
""",conn)
*several similar queries**
conn.close()
return [(query1,query2...)]
I think the error likely comes from the string formatting I am doing based on the testing I have done, but I am not sure why it has become an issue now/randomly, when again the code has been working for quite a while now. The sql if I remove the date formatting, runs perfectly fine, and very quickly elsewhere, so that is why the formatting seems to be the issue*.
EDIT*
Even after editing out the string formatting I was still having issues, but I was able to isolate it down to one single query actually--which is actually not** able to run on a direct sql query either so it is likely not a python/string issue but a DB issue (my apologies for the misplaced assurance),
the query is:
select column1,column2,sum(column3)
from
(select several variables,Case When ... as column2
from table1 inner join table2 inner join table3
where several conditions
UNION ALL
select several variables, Case When ... as column2
from table1 inner join table2 inner join table3
where several conditions)
group by column1, column2
having (ABS(sum(column3))>=10000 and column1 in ('a','b','c'))
or (column1 not in ('a','b','c'))
order by column1, sum(column3) desc
I assume there must of been some changes on the DB end that would make running this somewhat hefty query currently un-runnable to give the above error? Isolating it further-- it looks like its potentially related to grouping by the CASE When variable column 2
As others have commented this error has to do with TEMP Tablespace Sizing and the workload on the DB at the time. Also, when you get TEMP “blowouts”, it is possible that a particular SQL execution becomes the “victim” of rather than the cause of high temp usage. Sadly, there is no way to predict in advance how much TEMP tablespace a particular workload will require, so getting to the right setting is a process involving incrementally increasing TEMP as per your best estimations; TEMP needs to be big enough to handle peak workload conditions. Having said that you can use the Active Session History [requires diagnostics pack) to find the high TEMP consuming SQL and possibly tune it to use less TEMP. Alternatively, you can use/instrument v$sort_usage to find which SQL are using the most TEMP. The following query can be used to examine the current TEMP tablespace usage:
with sort as
(
SELECT username, sqladdr, sqlhash, sql_id, tablespace, session_addr
, sum(blocks) sum_blocks
FROM v$sort_usage
GROUP BY username, sqladdr, sqlhash, sql_id, tablespace, session_addr
)
, temp as
(
SELECT A.tablespace_name tablespace, D.mb_total,
SUM (A.used_blocks * D.block_size) / 1024 / 1024 mb_used,
D.mb_total - SUM (A.used_blocks * D.block_size) / 1024 / 1024 mb_free
FROM v$sort_segment A,
(
SELECT B.name, C.block_size, SUM (C.bytes) / 1024 / 1024 mb_total
FROM v$tablespace B, v$tempfile C
WHERE B.ts#= C.ts#
GROUP BY B.name, C.block_size
) D
WHERE A.tablespace_name = D.name
GROUP by A.tablespace_name, D.mb_total
)
SELECT to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') sample_time
, sess.sql_id
, CASE WHEN elapsed_time > 2*86399*1000000
THEN '2 ' || to_char(to_date(round((elapsed_time-(2*86399*1000000))/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
WHEN elapsed_time > 86399*1000000
THEN '1 ' || to_char(to_date(round((elapsed_time-(86399*1000000))/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
WHEN elapsed_time <= 86399*1000000
THEN to_char(to_date(round(elapsed_time/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
END as time_per_execution
, sum_blocks*dt.block_size/1024/1024 usage_mb, sort.tablespace
, temp.mb_used, temp.mb_free, temp.mb_total
, sort.username, sess.sid, sess.serial#
, p.spid, sess.osuser, sess.module, sess.machine, p.program
, vst.sql_text
FROM sort,
v$sqlarea vst,
v$session sess,
v$process p,
dba_tablespaces dt
, temp
WHERE sess.sql_id = vst.sql_id (+)
AND sort.session_addr = sess.saddr (+)
AND sess.paddr = p.addr (+)
AND sort.tablespace = dt.tablespace_name (+)
AND sort.tablespace = temp.tablespace
order by 4 desc
;
When I use the term "instrument" I mean, you can periodically persist the results of running this query so that you can look at a later time to see what was running when you got a TEMP blowout.
In my case there're some select sqls execute very long time, about 10 minutes, but general sql usually can completed in 5 seconds.
and I find in the sql it uses SELECT COLUMN_A||COLUMN_B FROM TABLE_C ... , that is it uses || to concatenate columns, and compare with another table, and another table also use ||, so it may cause Oracle Database execute Table Full Scan and use lots of memory, and ORA-01652 occur.
after change || to general column compare: SELECT COLUMN_A, COLUMN_B FROM TABLE_C ... , it can normally execute, without error.

How to use extra function to aggregate a table separately django?

How to use extra function to aggregate a table separately django?
I tried the example below, but I did not succeed.
Thank you very much for your attention.
aulas = Aula.objects.extra(
select={
'reposicao': 'SELECT * FROM app_reposicao WHERE app_cfc_reposicao.id = app_aula.id_tipo'
})
subquery must return only one column LINE 1: SELECT (SELECT * FROM
app_reposicao WHERE app_reposi...

Fast complex SQL queries on PostgreSQL database with Python

I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function

Unable to run a standardSQL query to bigquery in python

I am trying to query a table in Bigquery via a Python script. However I have written the query as a standard sql query. For this I need to start my query with '#standardsql'. However when I do this it then comments out the rest of my query. I have tried to write the query using multiple lines but it does not allow me to do this either. Has anybody dealt with a problem like this and found out a solution? Below is my first code where the query becomes commented out.
client = bigquery.Client('dataworks-356fa')
query = ("#standardsql SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'")
dataset = client.dataset('FirebaseArchive')
table = dataset.table('test2')
tbl = dataset.table('Count_BB_Serial_weekly')
job = client.run_async_query(str(uuid.uuid4()), query)
job.destination = tbl
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
When I try to write the query like this python does not read anything past on the second line as the query.
query = ("#standardsql
SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'")
The query Im running selects values that have been produced within the last week. If there is a variation of this that would not be required to use standardsql I would be willing to switch my other queries as well but I have not been able to figure out how to do that. I would prefer for this to be the last resort though. Thank you for the help!
If you want to flag you'll be using Standard SQL inside the query itself, you can build it like:
query = """#standardSQL
SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'
"""
Another option you can use as well is setting the property use_legacy_sql of the job created to False, something like:
job = client.run_async_query(job_name, query)
job.use_legacy_sql = False # -->this also makes the API use Standard SQL
job.begin()

peewee select() return SQL query, not the actual data

I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))

Categories

Resources