I'm working on pandas for high performance calculations, the below function gives 1 loop, best of 5: 7.24 s per loop for 50,000 rows.
I have to scale it to 1 million rows.
How to vectorise the function and apply to all rows. So that overall performance can be improved?
def weightedFlowAmt(startDate,endDate,tradeDate,tradeAmt):
startInDays = datetime.strptime(startDate, "%Y-%m-%d")
endInDays = datetime.strptime(endDate, "%Y-%m-%d")
tradeInDays = datetime.strptime(tradeDate, "%Y-%m-%d")
differenceTradeAndEnd=abs((endInDays - tradeInDays).days)
differenceStartAndEnd=abs((endInDays - startInDays).days)
weighted_FlowAmt = (tradeAmt * differenceTradeAndEnd)/differenceStartAndEnd
mutatedCashFlow['flow'] = mutatedCashFlow.apply(lambda row:
weightedFlowAmt(row['startDate'], row['EndDate'], row['tradeDate'],
row['tradeAmount']),
axis=1)
I think you can remove apply and use vectorized functions:
mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])
diffTradeAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['tradeDate']).dt.days).abs()
diffStartAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['startDate']).dt.days).abs()
mutatedCashFlow['flow'] = (mutatedCashFlow['tradeAmount']*diffTradeAndEnd)/diffStartAndEnd
Alternative:
mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])
diffTradeAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['tradeDate']).dt.days.abs()
diffStartAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['startDate']).dt.days.abs()
mutatedCashFlow['flow'] = mutatedCashFlow['tradeAmount'].mul(diffTradeAndEnd)
.div(diffStartAndEnd)
print (mutatedCashFlow)
Related
I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.
Having a dataframe 'df_data':
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)
First I create a function that will check the previous conditions:
def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
return 1
else:
return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()
Then I apply previous condition row wise:
df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)
The problem is that I have always get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Which comes due to applying the boolean condition from 'get_max' function to a Serie.
I will love to have a vectorized function, not using loops.
Try:
df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)
The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise. Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1) normally will work fine here.
As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.
On my computer, the below code posts:
LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s
LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s
In general therefore it is best to avoid using df.apply(..., axis = 1) as you can almost always get a better solution using logical operators.
import pandas as pd
from datetime import datetime
LIST_MULTIPLIER = 100
ITERATIONS = 100
def get_dataframe():
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63,
131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99,
138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
df = pd.DataFrame(temp_list)
df.columns = ['high']
return df
df_original = get_dataframe()
t1 = datetime.now()
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
mask_prev = df['high'] > df['high_prev']
mask_post = df['high'] > df['high_post']
mask_rolling = df['high'] > df['rolling_max']
mask_max = mask_prev & mask_post & mask_rolling
df['ismax'] = 0
df.loc[mask_max, 'ismax'] = 1
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()
t3 = datetime.now()
def get_max_rowwise(row):
if ((row.high > row.high_prev) &
(row.high > row.high_post) &
(row.high > row.rolling_max)):
return 1
else:
return 0
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
df['ismax'] = df.apply(get_max_rowwise, axis = 1)
t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()
Could someone point out what I did wrong with following dask implementation, since it doesnt seems to use the multi cores.
[ Updated with reproducible code]
The code that uses dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
calculate_feature_stats = dask.delayed(calculate_feature_stats)
rows = []
for bookid in bookingID.tolist():
row = calculate_feature_stats(bookid)
rows.append(row)
start = time.time()
rows = dask.persist(*rows)
end = time.time()
print(end - start) # Execution time = 16s in my machine
Code with normal implementation without dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats_normal(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
rows = []
start = time.time()
for bookid in bookingID.tolist():
row = calculate_feature_stats_normal(bookid)
rows.append(row)
end = time.time()
print(end - start) # Execution time = 4s in my machine
So, without dask actually faster, how is that possible?
Answer
Extended comment. You should consider that using dask there is about 1ms overhead (see doc) so if your computation is shorther than that then dask It isn't worth the trouble.
Going to your specific question I can think of two possible real world scenario:
1. A big dataframe with a column called bookingID and another value
2. A different file for every bookingID
In the second case you can play from this answer while for the first case you can proceed as following:
import dask.dataframe as dd
import numpy as np
import pandas as pd
# create dummy df
df = []
for i in range(10_000):
df.append(pd.DataFrame({"id":i,
"value":np.random.rand(1000)}))
df = pd.concat(df, ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df.to_parquet("df.parq")
Pandas
%%time
df = pd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":{"min", "max", "std", "mean"}})
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 1.65 s, sys: 316 ms, total: 1.96 s
Wall time: 1.08 s
Dask
%%time
df = dd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":["min", "max", "std", "mean"]}).compute()
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 4.94 s, sys: 427 ms, total: 5.36 s
Wall time: 3.94 s
Final thoughts
In this situation dask starts to make sense if the df doesn't fit in memory.
I have the a script similar to this:
import random
import pandas as pd
FA = []
FB = []
Value = []
df = pd.DataFrame()
df_save = pd.DataFrame(index=['min','max'])
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
numbers = list(range(24)) # FA.unique()
mix = '(pairwise combination of days and numbers, i.e. 0Monday,0Tuesday,...1Monday,1Tuesday,....)' 'I dont know how to do this combination btw'
def Calculus():
global min,max
min = df['Value'][boolean].min()
max = df['Value'][boolean].max()
for i in range(1000):
FA.append(random.randrange(0,23,1))
FB.append(random.choice(days))
Value.append(random.random())
df['FA'] = FA
df['FB'] = FB
df['FAB'] = df['FA'].astype(str) + df['FB'].astype(str)
df['Value'] = Value
mix_factor = df['FA'].astype(str) + df['FB'].astype(str)
for i in numbers:
boolean = df['FA'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in days:
boolean = df['FB'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in mix_factor.unique():
boolean = df['FAB'] == i
Calculus() #
df_save[str(i)] = [min,max]
My question is: there is another way to do the same but more time efficiently? My real data (df in this case) is a csv with millions of rows and this three loops are taking forever.
Maybe using 'apply' but I never have worked with it before.
Any insight will be very appreciate, thanks.
You could put all three loops into one, depending on what your exact code is. Is there a parameter for calculus? If not, putting them into one would allow you to have to run Calculus() less
I have the dataframe below. bet_pl and co_pl keep track of the daily changes in the 2 balances. I have updated co_balance based on co_pl and the cumsum.
init_balance = D('100.0')
co_thresh = D('1.05') * init_balance
def get_pl_co(row):
if row['eod_bet_balance'] > co_thresh:
diff = row['eod_bet_balance']- co_thresh
return(diff)
else:
return Decimal('0.0')
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df_odds_winloss['sod_bet_balance']= df_odds_winloss['eod_bet_balance'].shift(1).fillna(init_balance)
df_odds_winloss['co_pl'] = df_odds_winloss.apply(get_pl_co, axis=1)
df_odds_winloss['co_balance'] = df_odds_winloss['co_pl'].cumsum()
# trying this
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['eod_bet_balance'] - df_odds_winloss['co_pl']
Now I want the eod_bet_balance to update with negative co_pl as it is a transfer between the 2 balances, but am not getting the right eod (end of day) balances.
Can anyone give a hint?
UPDATED: The eod_balances reflect the change in bet_pl but not the subsequent change in co_pl.
FINAL UPDATE:
initial_balance = D('100.0')
df = pd.DataFrame({ 'SP': res_df['SP'], 'winloss': bin_seq_l}, columns=['SP', 'winloss'])
df['bet_pl'] = df.apply(get_pl_lvl, axis=1)
df['interim_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df['co_pl'] = (df['interim_balance'] - co_thresh).clip_lower(0)
df['co_balance'] = df_odds_winloss['co_pl'].cumsum()
df['post_co_balance'] = df['interim_balance'] - df['co_pl']
bf_r = D('0.05')
df['post_co_deduct_balance'] = df['post_co_balance'] - (df['post_co_balance']* bf_r)
df['sod_bet_balance'] = df['post_co_deduct_balance'].shift(1).fillna(init_balance)
First, you don't need to apply a custom function to get co_pl, it could be done like so:
df['co_pl'] = (df['eod_bet_balance'] - co_thresh).clip_lower(0)
As for updating the other column, if I understand correctly you want something like this:
df['eod_bet_balance'] = df['eod_bet_balance'].clip_upper(co_thresh)
or, equivalently...
df['eod_bet_balance'] -= df['co_pl']
I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)