The task is a grouping of datetime values (using SQLAlchemy) into per minute points (group by minute).
I have a custom SQL-query:
SELECT COUNT(*) AS point_value, MAX(time) as time
FROM `Downloads`
LEFT JOIN Mirror ON Downloads.mirror = Mirror.id
WHERE Mirror.domain_name = 'localhost.local'
AND `time` BETWEEN '2012-06-30 00:29:00' AND '2012-07-01 00:29:00'
GROUP BY DAYOFYEAR( time ) , ( 60 * HOUR( time ) + MINUTE(time ))
ORDER BY time ASC
It works great, but now I have do it in SQLAlchemy. This is what I've got for now (grouping by year is just an example):
rows = (DBSession.query(func.count(Download.id), func.max(Download.time)).
filter(Download.time >= fromInterval).
filter(Download.time <= untilInterval).
join(Mirror,Download.mirror==Mirror.id).
group_by(func.year(Download.time)).
order_by(Download.time)
)
It gives me this SQL:
SELECT count("Downloads".id) AS count_1, max("Downloads".time) AS max_1
FROM "Downloads" JOIN "Mirror" ON "Downloads".mirror = "Mirror".id
WHERE "Downloads".time >= :time_1 AND "Downloads".time <= :time_2
GROUP BY year("Downloads".time)
ORDER BY "Downloads".time
As you can see, it lacking only the correct grouping:
GROUP BY DAYOFYEAR( time ) , ( 60 * HOUR( time ) + MINUTE(time ))
Does SQLAlchemy have some function to group by minute?
You can use any SQL side function from SA by means of Functions, which you already use fr the YEAR part. I think in your case you just need to add (not tested):
from sqlalchemy.sql import func
...
# add another group_by to your existing query:
rows = ...
group_by(func.year(Download.time),
60 * func.HOUR(Download.time) + func.MINUTE(Download.time)
)
Related
I am pulling data from an online database using SQL/postgresql queries and converting it into a Python dataframe using Pandas. I want to be able to change the dates in the SQL query from one point in my Python script instead of having to manually go through every SQL query and change it one by one as there are many queries and many lines in each one.
This is what I have to begin with for example:
random_query = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('2022-03-01')
and date_trunc('day',a.created_at) <= date('2022-03-31')
group by 1,2,3
"""
Then I will read the data into Pandas as follows:
df_random_query = pd.read_sql(random_query, conn)
The connection above is to the database - the issue is not there so I am excluding that portion of code here.
What I have attempted is the following:
start_date = '2022-03-01'
end_date = '2022-03-31'
I have set the above 2 dates as variables and then below I have tried to use them in the SQL query as follows:
attempted_solution = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date(
""" + start_date + """)
and date_trunc('day',a.created_at) <= date(
""" + end_date + """)
group by 1,2,3
"""
This does run but it gives me a dataframe with no data in it - i.e. no numbers. I am not sure what I am doing wrong - any assistance will really help.
try dropping date function and formatting:
my_query = f"... where date_trunc('day', a.created_at) >= {start_date}"
I was able to work it out as follows:
start_date = '2022-03-01'
end_date = '2022-03-31'
random_query = f"""
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('start_date')
and date_trunc('day',a.created_at) <= date('end_date')
group by 1,2,3
"""
It was amusing to see that all I needed to do was put start_date and end_date in ' ' as well. I noticed this simply by printing what query was showing in the script. Key thing here is to know how to troubleshoot.
Another option was also to use the .format() at the end of the query and inside it say .format(start_date = '2022-03-01', end_date = '2022-03-31').
I want to query max date in a table and use this as parameter in a where clausere in another query. I am doing this:
query = (""" select
cast(max(order_date) as date)
from
tablename
""")
cursor.execute(query)
d = cursor.fethcone()
as output:[(datetime.date(2021, 9, 8),)]
Then I want to use this output as parameter in another query:
query3=("""select * from anothertable
where order_date = d::date limit 10""")
cursor.execute(query3)
as output: column "d" does not exist
I tried to cast(d as date) , d::date but nothing works. I also tried to datetime.date(d) no success too.
What I am doing wrong here?
There is no reason to select the date then use it in another query. That requires 2 round trips to the server. Do it in a single query. This has the advantage of removing all client side processing of that date.
select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
);
I am not exactly how this translates into your obfuscation layer but, from what I see, I believe it would be something like.
query = (""" select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
) """)
cursor.execute(query)
Heed the warning by #OneCricketeer. You may need cast on anothertable order_date as well. So where cast(order_date as date) = ( select ... )
I am trying to write the following PosgreSQL query in SQLAlchemy:
SELECT DISTINCT user_id
FROM
(SELECT *, (amount * usd_rate) as usd_amount
FROM transactions AS t1
LEFT JOIN LATERAL (
SELECT rate as usd_rate
FROM fx_rates fx
WHERE (fx.ccy = t1.currency) AND (t1.created_date > fx.ts)
ORDER BY fx.ts DESC
LIMIT 1
) t2 On true) AS complete_table
WHERE type = 'CARD_PAYMENT' AND usd_amount > 10
So far, I have the lateral join by using subquery in the following way:
lateral_query = session.query(fx_rates.rate.label('usd_rate')).filter(fx_rates.ccy == transactions.currency,
transactions.created_date > fx_rates.ts).order_by(desc(fx_rates.ts)).limit(1).subquery('rates_lateral').lateral('rates')
task2_query = session.query(transactions).outerjoin(lateral_query,true()).filter(transactions.type == 'CARD_PAYMENT')
print(task2_query)
This produces:
SELECT transactions.currency AS transactions_currency, transactions.amount AS transactions_amount, transactions.state AS transactions_state, transactions.created_date AS transactions_created_date, transactions.merchant_category AS transactions_merchant_category, transactions.merchant_country AS transactions_merchant_country, transactions.entry_method AS transactions_entry_method, transactions.user_id AS transactions_user_id, transactions.type AS transactions_type, transactions.source AS transactions_source, transactions.id AS transactions_id
FROM transactions LEFT OUTER JOIN LATERAL (SELECT fx_rates.rate AS usd_rate
FROM fx_rates
WHERE fx_rates.ccy = transactions.currency AND transactions.created_date > fx_rates.ts ORDER BY fx_rates.ts DESC
LIMIT %(param_1)s) AS rates ON true
WHERE transactions.type = %(type_1)s
Which print the correct lateral query,but so far I don't know how to add the calculated field (amount*usd_rate), so I can apply the distinct and where statements.
Add the required entity in the Query, give it a label, and use the result as a subquery as you've done in SQL:
task2_query = session.query(
transactions,
(transactions.amount * lateral_query.c.usd_rate).label('usd_amount')).\
outerjoin(lateral_query, true()).\
subquery()
task3_query = session.query(task2_query.c.user_id).\
filter(task2_query.c.type == 'CARD_PAYMENT',
task2_query.c.usd_amount > 10).\
distinct()
On the other hand wrapping it in a subquery should be unnecessary, since you can use the calculated USD amount in a WHERE predicate in the inner query just as well:
task2_query = session.query(transactions.user_id).\
outerjoin(lateral_query, true()).\
filter(transactions.type == 'CARD_PAYMENT',
transactions.amount * lateral_query.c.usd_rate > 10).\
distinct()
I have two dataframes in this problem I want to add a column to loan_df which aggregates across recharge_df. So for each loan given, I want to get the borrower's mean recharges prior to the date the loan was taken (in this case 90 days prior). I will then add this new column to loan_df. My code below works but is slow. Any ideas on how to make it super efficient?
def mean_rec_func(msisdn,date,advance_id,window, name):
"""Returns mean recharges within a specified number of days prior to loan being taken
Keyword Arguments:
msisdn -- APF_MSISDN for loan (this is like customer ID)
date -- APF_DATE on which loan taken
advance_id -- APF_ADVANCE_ID for loan
window -- number of days to look back(int)
name -- name of the newly computed stat
"""
mean_rec = recharge_df.loc[(recharge_df['APF_MSISDN'] == msisdn) &
(recharge_df['APF_DATE']<date)
& (recharge_df['APF_DATE']>=date - datetime.timedelta(days = window))
]['APF_AMOUNT'].mean()
return pd.Series([advance_id,msisdn,mean_rec], index=['APF_ADVANCE_ID', 'APF_MSISDN', name])
# Mean recharge over last 90 days
mean_recharge_90 = loan_df.apply(lambda row: mean_rec_func(row['APF_MSISDN'], row['APF_DATE'],
row['APF_ADVANCE_ID'],
window = 90,
name ="MEAN_RECHARGE_90"), axis = 1)
EDIT:
Consider an SQL solution as your logic translates into the following query with a correlated aggregate subquery (which admittedly is also an expensive type of query as aggregates are run for every outer query row, similar to pandas apply loop).
SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l
In pandas, you can use the pandasql module which runs an in-memory instance to SQLite:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l"""
output_df = pysqldf(q)
Below is the expanded version which runs under the hood of pandasql, interfacing to SQLAlchemy and pandas' import/export calls: read_sql and to_sql.
from sqlalchemy import create_engine
# IN-MEMORY DATABASE (NO PATH SPECIFIED)
engine = create_engine('sqlite://')
# EXPORT DATAFRAMES
recharge_df.to_sql("recharge_tbl", con=engine, if_exists='replace')
loan_df.to_sql("loan_tbl", con=engine, if_exists='replace')
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_tbl r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_tbl l"""
# IMPORT QUERY RESULT
output_df = pd.read_sql(strSQL, engine)
# IN-MEMORY DATABASE DESTROYED
engine.dispose()
I'm trying to calculate the equivalent of SELECT SUM(...) FROM ... GROUP BY .... Here's a simplified analogy:
Let's say Salesperson objects sell stuff and get a commission on the margin they generate through each Sale:
sp = Salesperson.objects.get(pk=1)
my_sales = Sale.objects.filter(fk_salesperson=sp)
#calculate commission owing to sp
commission = 0
for sale in my_sales:
commission += sp.commission_rate\
* (sale.selling_price - sale.cost_price)
That last loop could be done with something like:
.annotate( commission= ( F('selling_price')-F('cost_price') )\
* sp.commission_rate )
But can I then further aggregate the query for all Salesperson objects? I.e. I want to know every salesperson's commission (i.e. roughly SELECT SUM( (sale_price-cost_price) * commission_rate) FROM Sales GROUP BY Salesperson). I could do something like below, but I'm trying to do it with ORM:
commissions = []
salespeople = Salesperson.objects.all()
for sp in salespeople:
data = Sale.objects.filter(fk_salesperson=sp)\
.annotate(salesperson=F('sp__email')\
.annotate(commission= ( F('selling_price')-F('cost_price') )\
* sp.commission_rate )
commissions.append(data)
Is there a way to do this with a single query (making the reporting db server do the work) rather than doing it on my application server?
The Sum() aggregate function is available in django.db.models and you can use related fields in an F expression.
from django.db.models import F, Sum
Sales.objects.values('salesperson__id').annotate(commission=Sum(
(F('selling_price') - F('cost_price')) * F('salesperson__commission_rate')
))