Vectorized Solution to Iterrows - python

I have 2 dataframes : prediction_df and purchase_info_df. prediction_df contains customer id and prediction date. purchase_info_df contains customer id, purchase amount and purchase date. The dataframes are provided below for a single customer.
customer_id = [1, 1, 1]
prediction_date = ["2022-12-30", "2022-11-30", "2022-10-30"]
purchase_date = ["2022-11-12", "2022-12-01", "2022-09-03"]
purchase_amount = [500, 300, 100]
prediction_df = pd.DataFrame({"id":customer_id, "prediction_date":prediction_date})
purchase_info_df = pd.DataFrame({"id":customer_id,"purchase_date": purchase_date, "purchase_amount": purchase_amount})
prediction_df["prediction_date"] = pd.to_datetime(prediction_df["prediction_date"])
purchase_info_df["purchase_date"] = pd.to_datetime(purchase_info_df["purchase_date"])
My aim is to create features such as: total purchase, mean purchase amount, purchase amount in the last month etc. on the prediction_date. I can do this by the following code, which uses iterrows. This is way too slow when I have over a 100.000 customers. I am looking for a solution to vectorize the operations given in the below code, so that it will be faster.
res = []
for idx, rw in tqdm_notebook(prediction_df.iterrows(), total = prediction_df.shape[0]):
dep_dat = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date)]
dep_sum = dep_dat.purchase_amount.sum()
dep_mean = dep_dat.purchase_amount.mean()
dep_std = dep_dat.purchase_amount.std()
dep_count = dep_dat.purchase_amount.count()
last_15_days = rw.prediction_date - relativedelta(days = 15)
last_30_days = rw.prediction_date - relativedelta(days = 30)
last_45_days = rw.prediction_date - relativedelta(days = 45)
last_60_days = rw.prediction_date - relativedelta(days = 60)
last_15_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_15_days)].purchase_amount.sum()
last_30_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_30_days)].purchase_amount.sum()
last_45_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_45_days)].purchase_amount.sum()
last_60_days_dep_amount = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date <= rw.prediction_date) & (purchase_info_df.purchase_date >= last_60_days)].purchase_amount.sum()
last_15_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_15_days)].purchase_amount.count()
last_30_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_30_days)].purchase_amount.count()
last_45_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_45_days)].purchase_amount.count()
last_60_days_dep_count = purchase_info_df[(purchase_info_df.id == rw.id) & (purchase_info_df.purchase_date<= rw.prediction_date) & (purchase_info_df.purchase_date >= last_60_days)].purchase_amount.count()
res.append([rw.id,
rw.prediction_date,
dep_sum,
dep_mean,
dep_count,
last_15_days_dep_amount,
last_30_days_dep_amount,
last_45_days_dep_amount,
last_60_days_dep_amount,
last_15_days_dep_count,
last_30_days_dep_count,
last_45_days_dep_count,
last_60_days_dep_count])
output = pd.DataFrame(res, columns = ["id",
"prediction_date",
"amount_sum",
"amount_mean",
"purchase_count",
"last_15_days_dep_amount",
"last_30_days_dep_amount",
"last_45_days_dep_amount",
"last_60_days_dep_amount",
"last_15_days_dep_count",
"last_30_days_dep_count",
"last_45_days_dep_count",
"last_60_days_dep_count"])

Try this:
# Merge Prediction and Purchase Info for each customer, keeping only rows where
# purchase_date <= prediction_date.
# Depends on big the two frames are, your computer may run out of memory.
df = (
prediction_df.merge(purchase_info_df, on="id")
.query("purchase_date <= prediction_date")
)
cols = ["id", "prediction_date"]
# Each each customer on each prediction date, calculate some stats
stat0 = df.groupby(cols)["purchase_amount"].agg(["sum", "mean", "count"])
# Now calculate the stats within some time windows
stats = {}
for t in pd.to_timedelta([15, 30, 45, 60], unit="d"):
stats[f"last_{t.days}_days"] = (
df[df["purchase_date"] >= df["prediction_date"] - t]
.groupby(cols)["purchase_amount"]
.agg(["sum", "count"])
)
# Combine the individual stats for the final result
result = (
pd.concat([stat0, *stats.values()], keys=["all", *stats.keys()], axis=1)
.fillna(0)
)

Related

Python Excel Calculations

I am not getting correct calculations for 3 columns I am trying to write on a data sheet for specific date and time in time series data. I want to calculate difference between various times and the closing price time. I am having problem for some reason I can't get correct output for the calculations.
This is the output from this code.
import pandas as pd
import os
import numpy as np
from openpyxl import Workbook
# Read the data into a Pandas DataFrame
directory_path = "C:/Users/bean/Desktop/_L"
os.chdir(directory_path)
book = Workbook()
book.remove(book.active) # remove the first sheet
for file in os.listdir(directory_path):
if file.endswith(".csv"):
file_path = os.path.join(directory_path, file)
df = pd.read_csv(file_path)
# Create a new DataFrame for each file
df_diff = df[['Date', 'CET', 'NA', 'UTC', 'Name', 'BOLLBU', 'BOLLBM', 'BOLLBL',
'VWAP', 'VWAPSD1U', 'VWAPSD1L', 'VWAPSD2U', 'VWAPSD2L', 'ATR', 'ATRMA']]
df['Date'] = pd.to_datetime(df['Date'])
df['CET'] = pd.to_datetime(df['Date'])
df['UTC'] = pd.to_datetime(df['Date'])
df['NA'] = pd.to_datetime(df['Date'])
df_diff['Date'] = pd.to_datetime(df['Date'])
df_diff['CET'] = pd.to_datetime(df['CET'])
df_diff['UTC'] = pd.to_datetime(df['UTC'])
df_diff['NA'] = pd.to_datetime(df['NA'])
df_diff['Open'] = df['Open']
df_diff['High'] = df['High']
df_diff['Low'] = df['Low']
df_diff['Close'] = df['Close']
# Calculate the differences and add them as new columns
df_diff['Open Diff'] = (df['Open'].shift(-1) -
df['Open']) / df['Open'] * 100
df_diff['High Diff'] = (df['High'].shift(-1) -
df['High']) / df['High'] * 100
df_diff['Low Diff'] = (df['Low'].shift(-1) -
df['Low']) / df['Low'] * 100
df_diff['Close Diff'] = (
df['Close'].shift(-1) - df['Close']) / df['Close'] * 100
df_1635 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 35)].sort_values(by='Date', ascending=False)
df_1625 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 25)].sort_values(by='Date', ascending=False)
df_1620 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 20)].sort_values(by='Date', ascending=False)
df_1615 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 15)].sort_values(by='Date', ascending=False)
df_1610 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 10)].sort_values(by='Date', ascending=False)
df_1605 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 5)].sort_values(by='Date', ascending=False)
df_1600 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 0)].sort_values(by='Date', ascending=False)
df_1545 = df[(df['Date'].dt.hour == 15) & (
df['Date'].dt.minute == 45)].sort_values(by='Date', ascending=False)
df_1530 = df[(df['Date'].dt.hour == 15) & (
df['Date'].dt.minute == 30)].sort_values(by='Date', ascending=False)
df_1500 = df[(df['Date'].dt.hour == 15) & (
df['Date'].dt.minute == 0)].sort_values(by='Date', ascending=False)
df_1445 = df[(df['Date'].dt.hour == 14) & (
df['Date'].dt.minute == 45)].sort_values(by='Date', ascending=False)
df_1430 = df[(df['Date'].dt.hour == 14) & (
df['Date'].dt.minute == 30)].sort_values(by='Date', ascending=False)
df_1400 = df[(df['Date'].dt.hour == 14) & (
df['Date'].dt.minute == 0)].sort_values(by='Date', ascending=False)
df_1330 = df[(df['Date'].dt.hour == 13) & (
df['Date'].dt.minute == 30)].sort_values(by='Date', ascending=False)
df_1300 = df[(df['Date'].dt.hour == 13) & (
df['Date'].dt.minute == 0)].sort_values(by='Date', ascending=False)
df_1230 = df[(df['Date'].dt.hour == 12) & (
df['Date'].dt.minute == 30)].sort_values(by='Date', ascending=False)
df_0800 = df[(df['Date'].dt.hour == 8) & (
df['Date'].dt.minute == 0)].sort_values(by='Date', ascending=False)
# Calculate difference between Close price of df_1635 and other DataFrames
df_diff_1635_1625 = df_1635['Close'] - df_1625['Close']
df_diff_1635_1620 = df_1635['Close'].subtract(df_1620['Close'])
df_diff_1635_1615 = df_1635['Close'].subtract(df_1615['Close'])
df_diff_1635_1610 = df_1635['Close'].subtract(df_1610['Close'])
df_diff_1635_1605 = df_1635['Close'].subtract(df_1605['Close'])
df_diff_1635_1600 = df_1635['Close'].subtract(df_1600['Close'])
df_diff_1635_1545 = df_1635['Close'].subtract(df_1545['Close'])
df_diff_1635_1530 = df_1635['Close'].subtract(df_1530['Close'])
df_diff_1635_1500 = df_1635['Close'].subtract(df_1500['Close'])
df_diff_1635_1445 = df_1635['Close'].subtract(df_1445['Close'])
df_diff_1635_1430 = df_1635['Close'].subtract(df_1430['Close'])
df_diff_1635_1400 = df_1635['Close'].subtract(df_1400['Close'])
df_diff_1635_1330 = df_1635['Close'].subtract(df_1330['Close'])
df_diff_1635_1300 = df_1635['Close'].subtract(df_1300['Close'])
df_diff_1635_1230 = df_1635['Close'].subtract(df_1230['Close'])
df_diff_1635_0800 = df_1635['Close'].subtract(df_0800['Close'])
print(df_diff_1635_1625)
# Add Difference, Percent_Diff, and U/D columns to each DataFrame
df_1635['Difference'] = df_1635['Close'].subtract(
df_1635['Close'].shift())
df_1635['Percent_Diff'] = (df_1635['Difference'] /
df_1635['Close']) * 100
df_1635['U/D'] = np.where(df_1635['Difference'] > 0, 'U', 'D')
df_1625['Difference'] = df_diff_1635_1625
df_1625['Percent_Diff'] = (df_diff_1635_1625 / df_1635['Close']) * 100
df_1625['U/D'] = np.where(df_1625['Percent_Diff'] > 0, 'U', 'D')
print(df_1625.dtypes)
df_1620['Difference'] = df_diff_1635_1620
df_1620['Percent_Diff'] = (df_diff_1635_1620 / df_1635['Close']) * 100
df_1620['U/D'] = np.where(df_1620['Percent_Diff'] > 0, 'U', 'D')
df_1615['Difference'] = df_diff_1635_1615
df_1615['Percent_Diff'] = (df_diff_1635_1615 / df_1635['Close']) * 100
df_1615['U/D'] = np.where(df_1615['Percent_Diff'] > 0, 'U', 'D')
df_1610['Difference'] = df_diff_1635_1610
df_1610['Percent_Diff'] = (df_diff_1635_1610 / df_1635['Close']) * 100
df_1610['U/D'] = np.where(df_1610['Percent_Diff'] > 0, 'U', 'D')
df_1605['Difference'] = df_diff_1635_1605
df_1605['Percent_Diff'] = (df_diff_1635_1605 / df_1635['Close']) * 100
df_1605['U/D'] = np.where(df_1605['Percent_Diff'] > 0, 'U', 'D')
df_1600['Difference'] = df_diff_1635_1600
df_1600['Percent_Diff'] = (df_diff_1635_1600 / df_1635['Close']) * 100
df_1600['U/D'] = np.where(df_1600['Percent_Diff'] > 0, 'U', 'D')
df_1545['Difference'] = df_diff_1635_1545
df_1545['Percent_Diff'] = (df_diff_1635_1545 / df_1635['Close']) * 100
df_1545['U/D'] = np.where(df_1545['Percent_Diff'] > 0, 'U', 'D')
df_1530['Percent_Diff'] = (df_diff_1635_1530 / df_1635['Close']) * 100
df_1530['U/D'] = np.where(df_1530['Percent_Diff'] > 0, 'U', 'D')
df_1500['Difference'] = df_diff_1635_1500
df_1500['Percent_Diff'] = (df_diff_1635_1500 / df_1635['Close']) * 100
df_1500['U/D'] = np.where(df_1500['Percent_Diff'] > 0, 'U', 'D')
df_1445['Difference'] = df_diff_1635_1445
df_1445['Percent_Diff'] = (df_diff_1635_1445 / df_1635['Close']) * 100
df_1445['U/D'] = np.where(df_1445['Percent_Diff'] > 0, 'U', 'D')
df_1430['Difference'] = df_diff_1635_1430
df_1430['Percent_Diff'] = (df_diff_1635_1430 / df_1635['Close']) * 100
df_1430['U/D'] = np.where(df_1430['Percent_Diff'] > 0, 'U', 'D')
df_1400['Difference'] = df_diff_1635_1400
df_1400['Percent_Diff'] = (df_diff_1635_1400 / df_1635['Close']) * 100
df_1400['U/D'] = np.where(df_1400['Percent_Diff'] > 0, 'U', 'D')
df_1330['Difference'] = df_diff_1635_1330
df_1330['Percent_Diff'] = (df_diff_1635_1330 / df_1635['Close']) * 100
df_1330['U/D'] = np.where(df_1330['Percent_Diff'] > 0, 'U', 'D')
df_1300['Difference'] = df_diff_1635_1300
df_1300['Percent_Diff'] = (df_diff_1635_1300 / df_1635['Close']) * 100
df_1300['U/D'] = np.where(df_1300['Percent_Diff'] > 0, 'U', 'D')
df_1230['Difference'] = df_diff_1635_1230
df_1230['Percent_Diff'] = (df_diff_1635_1230 / df_1635['Close']) * 100
df_1230['U/D'] = np.where(df_1230['Percent_Diff'] > 0, 'U', 'D')
df_0800['Difference'] = df_diff_1635_0800
df_0800['Percent_Diff'] = (df_diff_1635_0800 / df_1635['Close']) * 100
df_0800['U/D'] = np.where(df_0800['Percent_Diff'] > 0, 'U', 'D')
df_25 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 25)].sort_values(by='Date', ascending=False)
df_35 = df[(df['Date'].dt.hour == 16) & (
df['Date'].dt.minute == 35)].sort_values(by='Date', ascending=False)
# Concat all results for each time into this sheet.
df_35 = df_35[['Date', 'CET', 'NA', 'UTC', 'Name', 'Open', 'Open Diff', 'High', 'High Diff', 'Low', 'Low Diff',
'Close', 'Close Diff', 'BOLLBU', 'BOLLBM', 'BOLLBL', 'VWAP', 'VWAPSD1U', 'VWAPSD1L', 'VWAPSD2U',
'VWAPSD2L', 'ATR', 'ATRMA']]
df_diff = df_diff.sort_values(by='Date', ascending=False)
df_diff = df_diff[['Date', 'CET', 'NA', 'UTC', 'Name', 'Open', 'Open Diff', 'High', 'High Diff', 'Low', 'Low Diff',
'Close', 'Close Diff', 'BOLLBU', 'BOLLBM', 'BOLLBL', 'VWAP', 'VWAPSD1U', 'VWAPSD1L', 'VWAPSD2U',
'VWAPSD2L', 'ATR', 'ATRMA']]
writer = pd.ExcelWriter(
f'{file.split(".")[0]}.xlsx', engine='openpyxl')
df_diff.to_excel(writer, sheet_name='df_diff', index=False, startrow=0)
df_35.to_excel(writer, sheet_name='Sheet_35min', index=False)
dataframes = [df_1625, df_1620, df_1615, df_1610, df_1605, df_1600, df_1545,
df_1530, df_1500, df_1445, df_1430, df_1400, df_1330, df_1300, df_1230, df_0800]
for i, df in enumerate(dataframes):
df.to_excel(writer, sheet_name=f"df_{i}", index=False)
writer.save()
Essentially, the calculations under the for loop and for df_35 are not coming out properly correctly? How am I doing the operations wrong? The date is datetime but I am accessing column in that specific time value so I don't understand why it doesn't work. I tried various ways here are a few calculation methods I tried that were still wrong,
Neither of these work
df_diff_1635_1625 = df_1635['Close'] - df_1625['Close']
df_diff_1635_1620 = df_1635['Close'].subtract(df_1620['Close'])
all my columns are mostly float64 including close ones except date which is datetime. i check and print the calculation i get nan values so its clearly not processing it.
`

Expanding Records Based On Date Range Pandas

I am attempting to expand the records in a data frame between two dates. Given the input file of single entry for each record, I want to expand it based on a given date.
Here is an example of the input:
Here is an example of the desired expanded output:
Based on some other examples and documentation online, what I attempted to do was expand out the data frame on a 6 month time frame to get two records for each year, then I corrected the dates based on the birthday of the records using a counter to determine the split for before and after birthday.
df_expand['DATE'] = [pd.date_range(s, e, freq='6M') for s, e in
zip(pd.to_datetime(df_expand['Exposure Start']),
pd.to_datetime(df_expand['Exposure Stop']))]
df_expand = df_expand.explode('DATE').drop(['Exposure Start', 'Exposure Stop'], axis=1)
df_merged['counter'] = range(len(df_merged))
df_merged['start end'] = np.where(df_merged['counter'] % 2 != 0, 1, 0)
df_merged['DoB Year'] = df_merged['DoB'].dt.year
df_merged['DoB Month'] = df_merged['DoB'].dt.month
df_merged['DoB Day'] = df_merged['DoB'].dt.day
df_merged.loc[df_merged['start end'] == 0, 'Exposure Start'] = '1/1/'+ df_merged['Calendar Year'].astype(str)
df_merged.loc[df_merged['start end'] == 1, 'Exposure Start'] = df_merged['DoB Month'].astype(str) + '/' + (df_merged['DoB Day'].astype(int)+1).astype(str) + '/' + df_merged['Calendar Year'].astype(str)
df_merged.loc[df_merged['start end'] == 0, 'Exposure Stop'] = df_merged['DoB Month'].astype(str) + '/' + df_merged['DoB Day'].astype(str) + '/' + df_merged['Calendar Year'].astype(str)
df_merged.loc[df_merged['start end'] == 1, 'Exposure Stop'] = '12/31/'+ df_merged['Calendar Year'].astype(str)
This solution is clearly not elegant, and while it worked originally for my proof of concept, it is now running into issues with edge cases involving rules for the Exposure Start.
Study years are split into 2 separate periods, around the record's birthday.
The initial exposure begins 1/1 of the study year (or, the date that the record enters the study, whichever comes later) and goes through the day before the birthday (or non-death exit date, if that comes sooner).
The 2nd period goes from the birthday to the end of the calendar year (or non-death exit date, if that comes sooner). Where a death is observed, exposure is continued through the next birthday.
An iterative solution is probably better suited, but this was the documentation and guidance I received.
df_merged = pd.read_excel("inputdatawithtestcase.xlsx")
df_merged['DATE'] = [pd.date_range(s, e, freq='6M') for s, e in
zip(pd.to_datetime(df_merged['Exposure Start']),
pd.to_datetime(df_merged['Exposure Stop']))]
df_merged = df_merged.explode('DATE')
df_merged['counter'] = range(len(df_merged))
df_merged['start end'] = np.where(df_merged['counter'] % 2 != 0, 1, 0)
df_merged['DoB Year'] = df_merged['DoB'].dt.year
df_merged['DoB Month'] = df_merged['DoB'].dt.month
df_merged['DoB Day'] = df_merged['DoB'].dt.day
df_merged = df_merged.reset_index()
df_merged["Exposure Start month"] = df_merged["Exposure Start"].dt.month
df_merged["Exposure Start day"] = df_merged["Exposure Start"].dt.day
df_merged["new_perfect_year"] = df_merged["DATE"].dt.year
df_merged["start end"].loc[3]
Last_column = []
second_last_column = []
for a in range(len(df_merged)):
if a>=1:
if df_merged["DoB Year"].loc[a] != match_date:
count = 0
if (df_merged["Exposure Start day"].loc[a] == 1) & (df_merged["Exposure Start month"].loc[a] == 1):
if df_merged["Exposure Start day"].loc[a] == 1:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Exposure Start day'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
elif count == 0:
date = str(df_merged['Exposure Start month'].loc[a]) + "/" +str(df_merged['Exposure Start day'].loc[a]) + "/" + str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
count = count + 1
else:
if df_merged["Exposure Start day"].loc[a] == 1:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
Last_column.append(date)
match_date = df_merged["DoB Year"].loc[a]
for a in range(len(df_merged)):
if (df_merged["Exposure Start day"].loc[a] == 1) & (df_merged["Exposure Start month"].loc[a] == 1):
if df_merged["Exposure Start day"].loc[a] == 1:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
if df_merged["start end"].loc[a]== 0:
date = str(df_merged['Exposure Start day'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
if df_merged["Exposure Start day"].loc[a] == 1:
if df_merged["start end"].loc[a]== 0:
date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
if df_merged["start end"].loc[a]== 0:
date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
else:
date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
second_last_column.append(date)
match_date = df_merged["DoB Year"].loc[a]
last = pd.DataFrame(Last_column, columns = ["Last column"])
last_2 = pd.DataFrame(second_last_column, columns = ["Second Last column"])
final_df = pd.concat([df_merged, last], axis = 1)
final_df = pd.concat([final_df, last_2], axis = 1)
final_df
final_df = final_df[["Record ID", "DoB", "Exposure Start", "Last column", "Second Last column"]]
final_df.to_csv("name_final_this_first.csv")

Update a function in python where first two columns doesn't exist

I have created a function which checks three columns and applies the conditions I have mentioned in the function. I have set first column(col0) as None. This is how my columns look like:
rule_id col0 col1 col2
50378 2 0 0
50402 12 9 6
52879 0 4 3
Here 'rule_id' column is the index
This is my code:
for i, j, in dframe.groupby('tx_id'):
df1 = pd.DataFrame(j)
df = df1.pivot_table(index = 'rule_id' , columns = ['date'], values =
'rid_fc', aggfunc = np.sum, fill_value = 0)
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
def build_comp(col1, col2, i, col0 = None):
conditions = [(df[col1] == 0) & (df[col2] == 0) ,(df[col1] == df[col2]) , (df[col1] != 0) & (df[col2] != 0) & (df[col1] > df[col2]) ,
(df[col1] != 0) & (df[col2] != 0) & (df[col1] < df[col2]) ,(df[col1] != 0) & (df[col2] == 0)]
choices = [np.nan , coeff[col1] , df[col2]/df[col1]*coeff[col1],df[col2]/df[col1]* coeff[col1],100]
condition = [(df[col2] != 0) , (df[col2] == 0)]
choice = [100 , np.nan]
if col0 is not None:
conditions.insert(1, (df[col1] != 0) & (df[col2] == 0) & (df[col0] != 0))
choices.insert(1, 25)
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
if col0 is None:
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
df['comp{}'.format(i)] = np.select(conditions , choices , default = np.nan)
df['comp{}'.format(i+1)] = np.select(condition , choice)
col_ref = None
col_prev = df.columns[0]
for i, col in enumerate(df.columns[1:], 1):
build_comp(col_prev, col, i, col_ref)
col_ref = col_prev
col_prev = col
if len(df.columns) == 1:
df['comp1'] = [100] * len(df)
'df' is the dataframe which has these columns.There are multiple conditions involved in this function as you can see. I want to add one more , which is both col0 and col1 are None but I don't know how. I tried adding a condition inside if col0 is None: like:
if col1 is None:
conditions.insert(0, (df[col2] != 0)
choices.insert(0, 100)
But it's not working. Suppose I have only one column (col2) and both col0 and col1 are not there, then the result should be like this as per my condition:
rule_id col2 comp1
50378 2 100
51183 3 100
But comp column is not getting created. If you guys could help me achieve that , I'd greatly appreciate it.
Current code(Edit): After using the code #Joël suggested. I made the alterations. This is the code:
def build_comp(col2, i, col0 = None, col1 = None):
conditions = [(df[col1] == df[col2]) & (df[col1] != 0) & (df[col2] != 0) , (df[col1] != 0) & (df[col2] != 0) & (df[col1] > df[col2]) ,
(df[col1] != 0) & (df[col2] != 0) & (df[col1] < df[col2]) ,(df[col1] != 0) & (df[col2] == 0)]
choices = [50 , df[col2]/df[col1]*50,df[col2]/df[col1]* 25,100]
condition = [(df[col2] != 0) , (df[col2] == 0)]
choice = [100 , np.nan]
if col0 is not None:
conditions.insert(1, (df[col1] != 0) & (df[col2] == 0) &
(df[col0]!= 0))
choices.insert(1, 25)
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
else:
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
if col1 is None:
conditions.insert(0, (df[col2] != 0))
choices.insert(0, 100)
conditions.insert(0, (df[col2] == 0))
choices.insert(0, np.nan)
df['comp{}'.format(i)] = np.select(conditions , choices , default = np.nan)
df['comp{}'.format(i+1)] = np.select(condition , choice)
col_ref = None
col_prev = df.columns[0]
for i, col in enumerate(df.columns[1:], 1):
build_comp(col,i, col_ref , col_prev)
col_ref = col_prev
col_prev = col
When I run this code , I am still not getting the comp column. This is what I am getting:
rule_id col2
50378 2
51183 3
But I should get this as per my logic:
rule_id col2 comp1
50378 2 100
51183 3 100
I know there is something wrong with the for loop and col_prev logic but I don't know what.
Edit: For more simplification , this is how my df looks like:
This is my `df' looks like after applying my code:
But now suppose there is only one timestamp column is present such as this:
Then I want the result to be this:
date 2018-12-11 13:41:51 comp1
rule_id
51183 1 100
52368 1 100
When df has a single column, the for loop gets skipped (i.e. the code in the loop does not get executed).
In order to add a column for the case where df has a single column, add the following code to the end:
if len(df.columns) == 1:
df['comp1'] = [100] * len(df)
This assumes that rule_id is the row labels. If not, then compare with 2 instead of 1.
Your condition about testing col1 is None is exactly the same as for col0; therefore, this is about setting a default value for col1 so that it may not be provided.
Therefore, your code should be something like this:
def build_comp(col2, i, col0 = None, col1 = None): # <== changing here
if col1 is not None: # we can compare <== EDITED HERE
conditions = [(df[col1] == 0) & (df[col2] == 0),
(df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0) & (df[col1] > df[col2]),
(df[col1] != 0) & (df[col2] != 0) & (df[col1] < df[col2]),
(df[col1] != 0) & (df[col2] == 0)]
choices = [np.nan,
50,
df[col2] / df[col1] * 50,
df[col2] / df[col1] * 25,
100]
condition = [(df[col2] != 0),
(df[col2] == 0)]
choice = [100,
np.nan]
if col0 is not None:
conditions.insert(1, (df[col1] != 0) & (df[col2] == 0) & (df[col0] != 0))
choices.insert(1, 50)
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
else: # if col0 is None: # <== use `else` instead of testing opposite
condition.insert(0,(df[col2] != 0) & (df[col1] != 0))
choice.insert(0, 25)
df['comp{}'.format(i)] = np.select(conditions , choices , default = np.nan)
df['comp{}'.format(i+1)] = np.select(condition , choice)
Beware, you use choices and choice for different stuff, that's not helping you.
Why are You using None?
IMO it`s better to use NaN.

making a Pandas while loop faster

I have a while loop which runs through a data frame A of 30000 rows and updates another data frame B and uses dataframe B for further iterations. its taking too much time. want to make it faster! any ideas
for x in range(0, dataframeA.shape[0]):
AuthorizationID_Temp = dataframeA["AuthorizationID"].iloc[x]
Auth_BeginDate = dataframeA["BeginDate"].iloc[x]
Auth_EndDate = dataframeA["EndDate"].iloc[x]
BeginDate_Temp = pd.to_datetime(Auth_BeginDate).date()
ScriptsFlag = dataframeA["ScriptsFlag"].iloc[x]
Legacy_PlacementID = dataframeA["Legacy_PlacementID"].iloc[x]
Legacy_AncillaryServicesID = dataframeA["Legacy_AncillaryServicesID"].iloc[x]
ProviderID_Temp = dataframeA["ProviderID"].iloc[x]
SRSProcode_Temp = dataframeA["SRSProcode"].iloc[x]
Rate_Temp = dataframeA["Rate"].iloc[x]
Scripts2["BeginDate1_SC"] = pd.to_datetime(Scripts2["BeginDate_SC"]).dt.date
Scripts2["EndDate1_SC"] = pd.to_datetime(Scripts2["EndDate_SC"]).dt.date
# BeginDate_Temp = BeginDate_Temp.date()
# EndDate_Temp = EndDate_Temp.date()
Scripts_New_Modified1 = Scripts2.loc[
((Scripts2["ScriptsFlag_SC"].isin(["N", "M"])) & (Scripts2["AuthorizationID_SC"] == AuthorizationID_Temp))
& ((Scripts2["ProviderID_SC"] == ProviderID_Temp) & (Scripts2["SRSProcode_SC"] == SRSProcode_Temp)),
:,
]
Scripts_New_Modified = Scripts_New_Modified1.loc[
(Scripts_New_Modified1["BeginDate1_SC"] == BeginDate_Temp)
& ((Scripts_New_Modified1["EndDate1_SC"] == EndDate_Temp) & (Scripts_New_Modified1["Rate_SC"] == Rate_Temp)),
"AuthorizationID_SC",
]
if ScriptsFlag == "M":
if Legacy_PlacementID is not None:
InsertA = insertA(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertA)
print("ScriptsTemp6 shape is {}".format(dataframeB.shape))
# else:
# ScriptsTemp6 = ScriptsTemp5.copy()
# print('ScriptsTemp6 shape is {}'.format(ScriptsTemp6.shape))
if Legacy_AncillaryServicesID is not None:
InsertB = insertB(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertB)
print("ScriptsTemp7 shape is {}".format(dataframeB.shape))
dataframe_New = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "N") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_New1 = dataframe_New.loc[
(pd.to_datetime(dataframe_New["BeginDate"]).dt.date == BeginDate_Temp)
& ((pd.to_datetime(dataframe_New["EndDate"]).dt.date == EndDate_Temp_DO) & (dataframe_New["Rate"] == Rate_Temp)),
"AuthorizationID",
]
# PLAATN = dataframeA.copy()
Insert1 = insert1(dataframe_New1, BeginDate_Temp, AuthorizationID_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
if Insert1.shape[0] > 0:
dataframeB = dataframeB.append(Insert1.iloc[0])
# else:
# ScriptsTemp8 = ScriptsTemp7
print("ScriptsTemp8 shape is {}".format(dataframeB.shape))
dataframe_modified1 = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "M") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_modified = dataframe_modified1.loc[
(dataframe_modified1["BeginDate"] == BeginDate_Temp)
& ((dataframe_modified1["EndDate"] == EndDate_Temp_DO) & (dataframe_modified1["Rate"] == Rate_Temp)),
"AuthorizationID",
]
Insert2 = insert2(
dataframe_modified,
Scripts_New_Modified,
AuthorizationID_Temp,
BeginDate_Temp,
EndDate_Temp,
Units_Temp,
EndDate_Temp_DO,
)
if Insert2.shape[0] > 0:
dataframeB = dataframeB.append(Insert2.iloc[0])
dataframeA having 30000 rows
dataframeB should be inserted with new rows every iteration(30000 iterations) from DataframeA
updated dataframeB should be used in middle of each iteration for filtering conditions
insertA and InsertB are two functions which has additional filtering
it takes too much time to run for 30000 rows so
so it takes more time to run.
provide suggestions for making the loop faster in terms of execution time

Concise way updating values based on column values

Background: I have a DataFrame whose values I need to update using some very specific conditions. The original implementation I inherited used a lot nested if statements wrapped in for loop, obfuscating what was going on. With readability primarily in mind, I rewrote it into this:
# Other Widgets
df.loc[(
(df.product == 0) &
(df.prod_type == 'OtherWidget') &
(df.region == 'US')
), 'product'] = 5
# Supplier X - All clients
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'X')
), 'product'] = 6
# Supplier Y - Client A
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'A')
), 'product'] = 1
# Supplier Y - Client B
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'B')
), 'product'] = 3
# Supplier Y - Client C
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'C')
), 'product'] = 4
Problem: This works well, and makes the conditions clear (in my opinion), but I'm not entirely happy because it's taking up a lot of space. Is there anyway to improve this from a readability/conciseness perspective?
Per EdChum's recommendation, I created a mask for the conditions. The code below goes a bit overboard in terms of masking, but it gives the general sense.
prod_0 = ( df.product == 0 )
ptype_OW = ( df.prod_type == 'OtherWidget' )
rgn_UKUS = ( df.region.isin['UK', 'US'] )
rgn_US = ( df.region == 'US' )
supp_X = ( df.supplier == 'X' )
supp_Y = ( df.supplier == 'Y' )
clnt_A = ( df.client == 'A' )
clnt_B = ( df.client == 'B' )
clnt_C = ( df.client == 'C' )
df.loc[(prod_0 & ptype_OW & reg_US), 'prod_0'] = 5
df.loc[(prod_0 & rgn_UKUS & supp_X), 'prod_0'] = 6
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_A), 'prod_0'] = 1
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_B), 'prod_0'] = 3
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_C), 'prod_0'] = 4

Categories

Resources