How to write excel-like formulas in Python - python

I developed a model in excel and now I am trying to write this model totally in python. But for calculation, I cannot think of another way to do the calculations for all the cells than with a loop, which makes the processing time too long. I want to know if there is another way to write the code and therefore reduce the time to process all the data.
peak_start = pd.to_timedelta('08:00:00')
peak_end = pd.to_timedelta('20:00:00')
for i in df_open_position.columns[3:10]:
df_open_position[i] = df_open_position['Index'][:8760].apply(lambda x: pf_targets.loc[(pf_targets['Target Startdatum'] <= df_open_position['Liefertag'][x]) & (pf_targets['Target Enddatum'] >= df_open_position['Liefertag'][x]) & (pf_targets['Produkt'].str.contains('Base')), 'Menge'].sum() \
+ pf_targets.loc[(df_open_position['Liefertag'][x].weekday() >= 5) & (peak_start <= df_open_position['Zeit'][x] <= peak_end) & (pf_targets['Produkt'].str.contains('Peak')), 'Menge'].sum() \
+ pf_swing_targets['Menge [MW]'][x] \
- pf_deals.loc[(pf_deals['Lieferbeginn'] <= df_open_position['Liefertag'][x]) & (pf_deals['Lieferende'] >= df_open_position['Liefertag'][x]) & (pf_deals[' Kaufrichtung'] == "Kauf") & (pf_deals[' Datum'] <= df_open_position['Liefertag'][x]) & (pf_deals[' Produkt'].str.contains('Base')), ' Menge' ].sum() \
- pf_deals.loc[(df_open_position['Liefertag'][x].weekday() >= 5) & (peak_start <= df_open_position['Zeit'][x] <= peak_end) & (pf_deals[' Produkt'].str.contains('Peak')) & (pf_deals['Lieferbeginn'] <= df_open_position['Liefertag'][x]) & (pf_deals['Lieferende'] >= df_open_position['Liefertag'][x]) & (pf_deals[' Kaufrichtung'] == "Kauf") & (pf_deals[' Datum'] <= df_open_position['Liefertag'][x]), ' Menge'].sum() \
+ pf_deals.loc[(pf_deals['Lieferbeginn'] <= df_open_position['Liefertag'][x]) & (pf_deals['Lieferende'] >= df_open_position['Liefertag'][x]) & (pf_deals[' Kaufrichtung'] == "Verkauf") & (pf_deals[' Datum'] <= df_open_position['Liefertag'][x]) & (pf_deals[' Produkt'].str.contains('Base')), ' Menge' ].sum() \
+ pf_deals.loc[(df_open_position['Liefertag'][x].weekday() >= 5) & (peak_start <= df_open_position['Zeit'][x] <= peak_end) & (pf_deals[' Produkt'].str.contains('Peak')) & (pf_deals['Lieferbeginn'] <= df_open_position['Liefertag'][x]) & (pf_deals['Lieferende'] >= df_open_position['Liefertag'][x]) & (pf_deals[' Kaufrichtung'] == "Verkauf") & (pf_deals[' Datum'] <= df_open_position['Liefertag'][x]), ' Menge'].sum())
Here is the code that I'm using in my excel file:
=IF(E$1="";"";
SUMIFS(Targets!$G:$G;Targets!$I:$I;"<="&$A3;Targets!$J:$J;">="&$A3;Targets!$F:$F;"*"&"Base")
+SUMIFS(Targets!$G:$G;Targets!$I:$I;"<="&$A3;Targets!$J:$J;">="&$A3;Targets!$F:$F;"TS")
+IF(AND(WORKDAY($A3-1;1)=$A3;$D3="peak");SUMIFS(Targets!$G:$G;Targets!$I:$I;"<="&$A3;Targets!$J:$J;">="&$A3;Targets!$F:$F;"*"&"Peak");)
-SUMIFS(Deals!$K:$K;Deals!$R:$R;"<="&$A3;Deals!$S:$S;">="&$A3;Deals!$J:$J;"Kauf";Deals!$I:$I;"<="&'PF 22'!E$1;Deals!$G:$G;"*"&"Base")
-IF(AND(WORKDAY($A3-1;1)=$A3;$D3="peak");SUMIFS(Deals!$K:$K;Deals!$R:$R;"<="&$A3;Deals!$S:$S;">="&$A3;Deals!$J:$J;"Kauf";Deals!$I:$I;"<="&'PF 22'!E$1;Deals!$G:$G;"*"&"Peak");)
+ SUMIFS(Deals!$K:$K;Deals!$R:$R;"<="&$A3;Deals!$S:$S;">="&$A3;Deals!$J:$J;"Verkauf";Deals!$I:$I;"<="&'PF 22'!E$1;Deals!$G:$G;"*"&"Base")
+IF(AND(WORKDAY($A3-1;1)=$A3;$D3="peak");SUMIFS(Deals!$K:$K;Deals!$R:$R;"<="&$A3;Deals!$S:$S;">="&$A3;Deals!$J:$J;"Verkauf";Deals!$I:$I;"<="&'PF 22'!E$1;Deals!$G:$G;"*"&"Peak");)
+$Z3)
I hope with all the information provided someone can help me.

Related

Vectorized Solution to Iterrows

I have 2 dataframes : prediction_df and purchase_info_df. prediction_df contains customer id and prediction date. purchase_info_df contains customer id, purchase amount and purchase date. The dataframes are provided below for a single customer.
customer_id = [1, 1, 1]
prediction_date = ["2022-12-30", "2022-11-30", "2022-10-30"]
purchase_date = ["2022-11-12", "2022-12-01", "2022-09-03"]
purchase_amount = [500, 300, 100]
prediction_df = pd.DataFrame({"id":customer_id, "prediction_date":prediction_date})
purchase_info_df = pd.DataFrame({"id":customer_id,"purchase_date": purchase_date, "purchase_amount": purchase_amount})
prediction_df["prediction_date"] = pd.to_datetime(prediction_df["prediction_date"])
purchase_info_df["purchase_date"] = pd.to_datetime(purchase_info_df["purchase_date"])
My aim is to create features such as: total purchase, mean purchase amount, purchase amount in the last month etc. on the prediction_date. I can do this by the following code, which uses iterrows. This is way too slow when I have over a 100.000 customers. I am looking for a solution to vectorize the operations given in the below code, so that it will be faster.
res = []
for idx, rw in tqdm_notebook(prediction_df.iterrows(), total = prediction_df.shape[0]):
dep_dat = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date)]
dep_sum = dep_dat.purchase_amount.sum()
dep_mean = dep_dat.purchase_amount.mean()
dep_std = dep_dat.purchase_amount.std()
dep_count = dep_dat.purchase_amount.count()
last_15_days = rw.prediction_date - relativedelta(days = 15)
last_30_days = rw.prediction_date - relativedelta(days = 30)
last_45_days = rw.prediction_date - relativedelta(days = 45)
last_60_days = rw.prediction_date - relativedelta(days = 60)
last_15_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_15_days)].purchase_amount.sum()
last_30_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_30_days)].purchase_amount.sum()
last_45_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_45_days)].purchase_amount.sum()
last_60_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_60_days)].purchase_amount.sum()
last_15_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_15_days)].purchase_amount.count()
last_30_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_30_days)].purchase_amount.count()
last_45_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_45_days)].purchase_amount.count()
last_60_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_60_days)].purchase_amount.count()
res.append([rw.id,
rw.prediction_date,
dep_sum,
dep_mean,
dep_count,
last_15_days_dep_amount,
last_30_days_dep_amount,
last_45_days_dep_amount,
last_60_days_dep_amount,
last_15_days_dep_count,
last_30_days_dep_count,
last_45_days_dep_count,
last_60_days_dep_count])
output = pd.DataFrame(res, columns = ["id",
"prediction_date",
"amount_sum",
"amount_mean",
"purchase_count",
"last_15_days_dep_amount",
"last_30_days_dep_amount",
"last_45_days_dep_amount",
"last_60_days_dep_amount",
"last_15_days_dep_count",
"last_30_days_dep_count",
"last_45_days_dep_count",
"last_60_days_dep_count"])
Try this:
# Merge Prediction and Purchase Info for each customer, keeping only rows where
# purchase_date <= prediction_date.
# Depends on big the two frames are, your computer may run out of memory.
df = (
prediction_df.merge(purchase_info_df, on="id")
.query("purchase_date <= prediction_date")
)
cols = ["id", "prediction_date"]
# Each each customer on each prediction date, calculate some stats
stat0 = df.groupby(cols)["purchase_amount"].agg(["sum", "mean", "count"])
# Now calculate the stats within some time windows
stats = {}
for t in pd.to_timedelta([15, 30, 45, 60], unit="d"):
stats[f"last_{t.days}_days"] = (
df[df["purchase_date"] >= df["prediction_date"] - t]
.groupby(cols)["purchase_amount"]
.agg(["sum", "count"])
)
# Combine the individual stats for the final result
result = (
pd.concat([stat0, *stats.values()], keys=["all", *stats.keys()], axis=1)
.fillna(0)
)

How can I filter by the dates I want in df.pipe Python?

def get_campaign_ids():
return df.pipe(alx.date_filter, alx.Dates(TODAY).this_week)['campaign id'].dropna().unique()
TODAY = dt.datetime.today()
start = TODAY - timedelta(days=TODAY.weekday())
end = TODAY - timedelta(days=TODAY.weekday()) + timedelta(days=6)
filtered_dates = (df['Date'] > start) & (df['Date'] <= end)
How can I replace 'this_week' with the week I define by start:end? I never worked with df.pipe before.
Thank you

What's wrong on my conditions ? Using the np.where statement to flag my pandas dataframes

The function i am using is keep giving the red filter condition where not applied.
Here the function i am using:
tolerance = 5
def rag(data):
red_filter = ((data.SHIPMENT_MOT_x == 'VESSEL') & \
((data.latedeliverydate + pd.to_timedelta(tolerance,unit='D')) < data.m6p)) | \
((data.SHIPMENT_MOT_x == 'AIR') & (data.latedeliverydate < data.m6p))
green_filter = (data.SHIPMENT_MOT_x == 'VESSEL') & \
(data.M6_proposed == data.m6p) & \
((data.latedeliverydate + pd.to_timedelta(tolerance,unit='D')) >= data.m6p) | \
((data.SHIPMENT_MOT_x == 'AIR') & (data.latedeliverydate >= data.m6p))
amber_filter = (data.SHIPMENT_MOT_x == 'VESSEL') & \
(data.M6_proposed != data.m6p) & \
((data.latedeliverydate + pd.to_timedelta(tolerance,unit='D')) >= data.m6p) | \
((data.SHIPMENT_MOT_x == 'AIR') & (data.latedeliverydate >= data.m6p))
data['RAG'] = np.where(green_filter, 'G', np.where(amber_filter, 'A', np.where(red_filter, 'R', '')))
Here is the solution if you guys are interested.
np.where is useful but would not recommend when there are multiple conditions
def pmm_rag(data):
if ((data.MOT== 'VESSEL') & ((data.m0p + pd.to_timedelta(tolerance,unit='D')) < data.m6p)) | ((data.SHIPMENT_MOT_x == 'AIR') & (data.m0p < data.m6p)):
return 'R'
elif (data.MOT== 'VESSEL') & (data.M6_proposed == data.m6p) & ((data.m0p + pd.to_timedelta(tolerance,unit='D')) >= data.m6p) | ((data.MOT== 'AIR') & (data.m0p >= data.m6p)):
return 'G'
elif (data.MOT== 'VESSEL') & (data.M6_proposed != data.m6p) & ((data.m0p + pd.to_timedelta(tolerance,unit='D')) >= data.m6p) | ((data.MOT== 'AIR') & (data.m0p >= data.m6p)):
return 'A'
else:
return ''

Efficent way to loop through Pandas dataframe rows

I am creating a population model featuring education.
I start with initial picture of the population that gives the number of people for each age group (0 to 95), and each level of education (0 - No education, to 6 - University).
This picture is treated as a column of a dataframe, that will iteratively be populated for each new year as a forecast.
In order to be populated there will be assumptions or things such as mortality rate of each age group, enrollment rates and success rates of each education level and so on.
The way I solved the problem is by adding a new column and iterate through the rows by using the value for age-1 from the previous year in order to compute the new value (eg. number of males with age 5 is the number of males with age 4 at year-1 less the ones that died)
The problem with this solution is that iterating through pandas dataframe rows using for loops and .loc is very inefficient and it takes a lot of time to compute the forecast
                              
def add_year_temp(pop_table,time,
old_year,new_year,
enrollment_rate_primary,
success_rate_primary,
enrollment_rate_1st_cycle,
success_rate_1st_cycle,
enrollment_rate_2nd_cycle,
success_rate_2nd_cycle,
enrollment_rate_3rd_cycle,
success_rate_3rd_cycle,
enrollment_rate_university,
success_rate_university,
mortality_rate_0_1,
mortality_rate_2_14,
mortality_rate_15_64,
mortality_rate_65,
mortality_mf_ratio,
enrollment_mf_ratio,
success_mf_ratio):
temp_table = pop_table
temp_table['year_ts'] = pd.to_datetime(temp_table[time])
temp_table['lag']= temp_table.groupby(['sex','schooling'])[old_year].shift(+1)
temp_table = temp_table.fillna(0)
for age in temp_table['age'].unique():
for sex in temp_table['sex'].unique():
mortality_mf_ratio_temp = 1
enrollment_mf_ratio_temp = 1
success_mf_ratio_temp = 1
if sex == 'F':
mortality_mf_ratio_temp = mortality_mf_ratio
enrollment_mf_ratio_temp = enrollment_mf_ratio
success_mf_ratio_temp = success_mf_ratio
if age <= 1:
for schooling in [0]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_0_1 * mortality_mf_ratio_temp)
elif 1 < age <= 5:
for schooling in [0]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_2_14 * mortality_mf_ratio_temp)
a lot of lines later you can see how for example I define the people that finish high-school and enter university...
elif 15 < age <= 17:
for schooling in [0 ,1 ,2 ,3 ,4]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age-1) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif age == 18:
for schooling in [0 ,1 ,2, 3, 4, 5]:
if schooling == 0:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 1:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 2:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 3:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 4:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp) \
* (1 - enrollment_rate_3rd_cycle * enrollment_mf_ratio_temp \
* success_rate_3rd_cycle * success_mf_ratio_temp)
elif schooling == 5:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling-1)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp) \
* (enrollment_rate_3rd_cycle * enrollment_mf_ratio_temp \
* success_rate_3rd_cycle * success_mf_ratio_temp)
And this continues for all age groups
As I said, it does work, but this is neither elegant nor fast...
Without having seen the verifiable output - https://stackoverflow.com/help/mcve - you can either use:
temp_table['mortality_mf_ratio'] = temp_table.apply(lambda row: some_function_per_row(row), axis=1)
Or you could use np.where https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
temp_table['mortality_mf_ratio'] = np.where(temp_table['sex'] == 'F', 1, 0)

Concise way updating values based on column values

Background: I have a DataFrame whose values I need to update using some very specific conditions. The original implementation I inherited used a lot nested if statements wrapped in for loop, obfuscating what was going on. With readability primarily in mind, I rewrote it into this:
# Other Widgets
df.loc[(
(df.product == 0) &
(df.prod_type == 'OtherWidget') &
(df.region == 'US')
), 'product'] = 5
# Supplier X - All clients
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'X')
), 'product'] = 6
# Supplier Y - Client A
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'A')
), 'product'] = 1
# Supplier Y - Client B
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'B')
), 'product'] = 3
# Supplier Y - Client C
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'C')
), 'product'] = 4
Problem: This works well, and makes the conditions clear (in my opinion), but I'm not entirely happy because it's taking up a lot of space. Is there anyway to improve this from a readability/conciseness perspective?
Per EdChum's recommendation, I created a mask for the conditions. The code below goes a bit overboard in terms of masking, but it gives the general sense.
prod_0 = ( df.product == 0 )
ptype_OW = ( df.prod_type == 'OtherWidget' )
rgn_UKUS = ( df.region.isin['UK', 'US'] )
rgn_US = ( df.region == 'US' )
supp_X = ( df.supplier == 'X' )
supp_Y = ( df.supplier == 'Y' )
clnt_A = ( df.client == 'A' )
clnt_B = ( df.client == 'B' )
clnt_C = ( df.client == 'C' )
df.loc[(prod_0 & ptype_OW & reg_US), 'prod_0'] = 5
df.loc[(prod_0 & rgn_UKUS & supp_X), 'prod_0'] = 6
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_A), 'prod_0'] = 1
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_B), 'prod_0'] = 3
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_C), 'prod_0'] = 4

Categories

Resources