I have two dataframes in this problem I want to add a column to loan_df which aggregates across recharge_df. So for each loan given, I want to get the borrower's mean recharges prior to the date the loan was taken (in this case 90 days prior). I will then add this new column to loan_df. My code below works but is slow. Any ideas on how to make it super efficient?
def mean_rec_func(msisdn,date,advance_id,window, name):
"""Returns mean recharges within a specified number of days prior to loan being taken
Keyword Arguments:
msisdn -- APF_MSISDN for loan (this is like customer ID)
date -- APF_DATE on which loan taken
advance_id -- APF_ADVANCE_ID for loan
window -- number of days to look back(int)
name -- name of the newly computed stat
"""
mean_rec = recharge_df.loc[(recharge_df['APF_MSISDN'] == msisdn) &
(recharge_df['APF_DATE']<date)
& (recharge_df['APF_DATE']>=date - datetime.timedelta(days = window))
]['APF_AMOUNT'].mean()
return pd.Series([advance_id,msisdn,mean_rec], index=['APF_ADVANCE_ID', 'APF_MSISDN', name])
# Mean recharge over last 90 days
mean_recharge_90 = loan_df.apply(lambda row: mean_rec_func(row['APF_MSISDN'], row['APF_DATE'],
row['APF_ADVANCE_ID'],
window = 90,
name ="MEAN_RECHARGE_90"), axis = 1)
EDIT:
Consider an SQL solution as your logic translates into the following query with a correlated aggregate subquery (which admittedly is also an expensive type of query as aggregates are run for every outer query row, similar to pandas apply loop).
SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l
In pandas, you can use the pandasql module which runs an in-memory instance to SQLite:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l"""
output_df = pysqldf(q)
Below is the expanded version which runs under the hood of pandasql, interfacing to SQLAlchemy and pandas' import/export calls: read_sql and to_sql.
from sqlalchemy import create_engine
# IN-MEMORY DATABASE (NO PATH SPECIFIED)
engine = create_engine('sqlite://')
# EXPORT DATAFRAMES
recharge_df.to_sql("recharge_tbl", con=engine, if_exists='replace')
loan_df.to_sql("loan_tbl", con=engine, if_exists='replace')
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_tbl r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_tbl l"""
# IMPORT QUERY RESULT
output_df = pd.read_sql(strSQL, engine)
# IN-MEMORY DATABASE DESTROYED
engine.dispose()
Related
I have a function in my main Python file which gets called by main() and executes a SQL Merge (Upsert) statement using pyodbc from a different file & function. Concretely, the SQL statement traverses a source table with transaction details by distinct transaction datetimes and merges customers into a separate target table. The function that executes the statement and the function that returns the completed SQL statement are attached below.
When I run my Python script, it doesn't work as expected and inserts only around 70 rows (sometimes 69, 71, or 72) into the target customer table. However, when I use an identical SQL statement and execute it in the Microsoft SQL Server Management Studio console (attached below), it works fine and inserts 4302 rows (as expected).
I'm not sure what's wrong.. Would really appreciate any help!
SQL Statement Executor in Python main file:
def stage_to_dim(connection, cursor, now):
log(f"Filling {cfg.dim_customer} and {cfg.dim_product}")
try:
cursor.execute(sql_statements.stage_to_dim_statement(now))
connection.commit()
except Exception as e:
log(f"Error in stage_to_dim: {e}" )
sys.exit(1)
log("Stage2Dimensions complete.")
SQL Statement formulator in Python:
def stage_to_dim_statement(now):
return f"""
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM {cfg.stage_table} ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE {cfg.dim_customer} AS Target
USING (SELECT * FROM {cfg.stage_table} WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME, '{now}'), Target.LoadSource = '{cfg.ingest_file_path}'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID,
Source.AcquisitionDate, Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'{now}'), '{cfg.ingest_file_path}');
END
"""
SQL Statement from MS SQL Server Console:
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM dbo.STG_CustomerTransactions ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE dbo.DIM_CustomerDup AS Target
USING (SELECT * FROM dbo.STG_CustomerTransactions WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME,'6/30/2022 11:53:05'), Target.LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, Source.AcquisitionDate,
Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'6/30/2022 11:53:05'), '../csv/cleaned_original_data.csv');
END
If you think carefully about what your final result ends up, you are actually just taking the latest row (by date) for each customer. So you can just filter the source using a standard row-number approach.
Exactly why the Python code didn't work properly is unclear, but the below query might work better. You are also doing SQL injection, which is dangerous and can also cause correctness problems.
Also you should always use a non-ambiguous date format.
MERGE dbo.DIM_CustomerDup AS t
USING (
SELECT *
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TransactionDateTime DESC)
FROM dbo.STG_CustomerTransactions s
) AS s
WHERE s.rn = 1
) AS s
ON t.CustomerCodeNK = s.CustomerID
WHEN MATCHED THEN
UPDATE SET
AquiredDate = s.AcquisitionDate,
AquiredSource = s.AcquisitionSource,
ZipCode = s.Zipcode,
LoadDate = SYSDATETIME(),
LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource)
VALUES (s.CustomerID, s.AcquisitionDate, s.AcquisitionSource, s.Zipcode, SYSDATETIME(), '../csv/cleaned_original_data.csv')
;
I am pulling data from an online database using SQL/postgresql queries and converting it into a Python dataframe using Pandas. I want to be able to change the dates in the SQL query from one point in my Python script instead of having to manually go through every SQL query and change it one by one as there are many queries and many lines in each one.
This is what I have to begin with for example:
random_query = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('2022-03-01')
and date_trunc('day',a.created_at) <= date('2022-03-31')
group by 1,2,3
"""
Then I will read the data into Pandas as follows:
df_random_query = pd.read_sql(random_query, conn)
The connection above is to the database - the issue is not there so I am excluding that portion of code here.
What I have attempted is the following:
start_date = '2022-03-01'
end_date = '2022-03-31'
I have set the above 2 dates as variables and then below I have tried to use them in the SQL query as follows:
attempted_solution = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date(
""" + start_date + """)
and date_trunc('day',a.created_at) <= date(
""" + end_date + """)
group by 1,2,3
"""
This does run but it gives me a dataframe with no data in it - i.e. no numbers. I am not sure what I am doing wrong - any assistance will really help.
try dropping date function and formatting:
my_query = f"... where date_trunc('day', a.created_at) >= {start_date}"
I was able to work it out as follows:
start_date = '2022-03-01'
end_date = '2022-03-31'
random_query = f"""
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('start_date')
and date_trunc('day',a.created_at) <= date('end_date')
group by 1,2,3
"""
It was amusing to see that all I needed to do was put start_date and end_date in ' ' as well. I noticed this simply by printing what query was showing in the script. Key thing here is to know how to troubleshoot.
Another option was also to use the .format() at the end of the query and inside it say .format(start_date = '2022-03-01', end_date = '2022-03-31').
I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function
The task is a grouping of datetime values (using SQLAlchemy) into per minute points (group by minute).
I have a custom SQL-query:
SELECT COUNT(*) AS point_value, MAX(time) as time
FROM `Downloads`
LEFT JOIN Mirror ON Downloads.mirror = Mirror.id
WHERE Mirror.domain_name = 'localhost.local'
AND `time` BETWEEN '2012-06-30 00:29:00' AND '2012-07-01 00:29:00'
GROUP BY DAYOFYEAR( time ) , ( 60 * HOUR( time ) + MINUTE(time ))
ORDER BY time ASC
It works great, but now I have do it in SQLAlchemy. This is what I've got for now (grouping by year is just an example):
rows = (DBSession.query(func.count(Download.id), func.max(Download.time)).
filter(Download.time >= fromInterval).
filter(Download.time <= untilInterval).
join(Mirror,Download.mirror==Mirror.id).
group_by(func.year(Download.time)).
order_by(Download.time)
)
It gives me this SQL:
SELECT count("Downloads".id) AS count_1, max("Downloads".time) AS max_1
FROM "Downloads" JOIN "Mirror" ON "Downloads".mirror = "Mirror".id
WHERE "Downloads".time >= :time_1 AND "Downloads".time <= :time_2
GROUP BY year("Downloads".time)
ORDER BY "Downloads".time
As you can see, it lacking only the correct grouping:
GROUP BY DAYOFYEAR( time ) , ( 60 * HOUR( time ) + MINUTE(time ))
Does SQLAlchemy have some function to group by minute?
You can use any SQL side function from SA by means of Functions, which you already use fr the YEAR part. I think in your case you just need to add (not tested):
from sqlalchemy.sql import func
...
# add another group_by to your existing query:
rows = ...
group_by(func.year(Download.time),
60 * func.HOUR(Download.time) + func.MINUTE(Download.time)
)