Pandas apply a change which affects 2 columns at the same time - python

I have the dataframe below. bet_pl and co_pl keep track of the daily changes in the 2 balances. I have updated co_balance based on co_pl and the cumsum.
init_balance = D('100.0')
co_thresh = D('1.05') * init_balance
def get_pl_co(row):
if row['eod_bet_balance'] > co_thresh:
diff = row['eod_bet_balance']- co_thresh
return(diff)
else:
return Decimal('0.0')
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df_odds_winloss['sod_bet_balance']= df_odds_winloss['eod_bet_balance'].shift(1).fillna(init_balance)
df_odds_winloss['co_pl'] = df_odds_winloss.apply(get_pl_co, axis=1)
df_odds_winloss['co_balance'] = df_odds_winloss['co_pl'].cumsum()
# trying this
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['eod_bet_balance'] - df_odds_winloss['co_pl']
Now I want the eod_bet_balance to update with negative co_pl as it is a transfer between the 2 balances, but am not getting the right eod (end of day) balances.
Can anyone give a hint?
UPDATED: The eod_balances reflect the change in bet_pl but not the subsequent change in co_pl.
FINAL UPDATE:
initial_balance = D('100.0')
df = pd.DataFrame({ 'SP': res_df['SP'], 'winloss': bin_seq_l}, columns=['SP', 'winloss'])
df['bet_pl'] = df.apply(get_pl_lvl, axis=1)
df['interim_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df['co_pl'] = (df['interim_balance'] - co_thresh).clip_lower(0)
df['co_balance'] = df_odds_winloss['co_pl'].cumsum()
df['post_co_balance'] = df['interim_balance'] - df['co_pl']
bf_r = D('0.05')
df['post_co_deduct_balance'] = df['post_co_balance'] - (df['post_co_balance']* bf_r)
df['sod_bet_balance'] = df['post_co_deduct_balance'].shift(1).fillna(init_balance)

First, you don't need to apply a custom function to get co_pl, it could be done like so:
df['co_pl'] = (df['eod_bet_balance'] - co_thresh).clip_lower(0)
As for updating the other column, if I understand correctly you want something like this:
df['eod_bet_balance'] = df['eod_bet_balance'].clip_upper(co_thresh)
or, equivalently...
df['eod_bet_balance'] -= df['co_pl']

Related

How to parallelize for loops in Pyspark?

I am trying to convert some Pandas code to Pyspark, which will run on an EMR cluster. This is my first time working with Pyspark, and I am not sure what is the optimal way to code the objective. The job is trying to achieve the following:
There is a base dataframe with schema like so:
institution_id, user_id, st_date
For every unique institution_id, get all users
For every user for the institution_id, take all unique st_dates in sorted order, get the difference between pairs of consecutive st_dates and output a dictionary
Here is what the code looks like as of now:
def process_user(current_user, inst_cycles):
current_user_dates = np.sort(current_user.st_date.unique())
if current_user_dates.size > 1:
prev_date = pd.to_datetime(current_user_dates[0]).date()
for current_datetime in current_user_dates[1:]:
current_date = pd.to_datetime(current_datetime).date()
month = current_date.month
delta = current_date - prev_date
cycle_days = delta.days
inst_cycles[month][cycle_days] += 1
prev_date = current_date
return inst_cycles
def get_inst_monthly_distribution(current_inst):
inst_cycles = defaultdict(lambda: defaultdict(int))
inst_user_ids = current_inst.select('user_id').distinct().collect()
for _, user_id in enumerate(inst_user_ids):
user_id_str = user_id[0]
current_user = current_inst.filter(current_inst.user_id == user_id_str)
inst_cycles = process_user(current_user, inst_cycles)
return inst_cycles
def get_monthly_distributions(inst_ids, df):
cycles = {}
for _, inst_id_str in enumerate(inst_ids.keys()):
current_inst = df.filter(df.inst_id == inst_id_str)
cycles[inst_id_str] = get_inst_monthly_distribution(current_inst)
return cycles
def execute():
df = load_data() # df is a Spark dataframe
inst_names = get_inst_names(df)
monthly_distributions = get_monthly_distributions(inst_names, df)
I think this code is not taking advantage of the parallelism of Spark, and can be coded in a much better way without the for loops. Is that correct?

How to optimize this pandas iterable

I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:
def disambiguate(arg):
arg['length'] = (arg.end - arg.begin).abs()
df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()
data = []
out = pd.DataFrame()
for row in df.itertuples():
test = df[df['note_id']==row.note_id].copy()
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
# filter out overlapping rows via hierarchy
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == minScore]
data.append(fx)
out = pd.concat(data, axis=0)
# randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
out.reset_index(inplace=True)
out = out.reindex(np.random.permutation(out.index))
return out.drop_duplicates(subset=['begin', 'end', 'note_id'])
This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).
Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.
Here is a sample of my data:
begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.
It should been:
cases = set(df['note_id'].tolist())
for case in cases:
test = df[df['note_id']==case].copy()
for row in df.itertuples():
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == maxLength]
data.append(fx)
out = pd.concat(data, axis=0)
For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!

Pandas - Local store a variable (multiplier)

Another question :) I need to know how to store a variable for reference at the beginning of my script. In this case, I am needing to store an FX conversion rate which I want to be able to adjust at the beginning of the script. I am also wanting to store a directory in my PC which will store the outputs of the script and is prone to changing, each month.
For reference, I have created the following example.
import pandas as pd
FX_rate = {'AUD':[0.71442],'NZD':[0.68476]}
Dir = 'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]
df1 = pd.DataFrame(df)
# ISSUE HERE - Covert Sales using FX_rate dictionary
df_USDAUD = df['AU_SALES'] * FX_rate['AUD']
df_USDNZD = df['NZ_SALES'] * FX_rate['NZD']
df_converted = df_USDAUD.append(df_USDNZD)
# Save output in folder, using Dir directory
df_converted.to_csv(Dir)
If I were to run this script I would get an error telling me that the number of values in df['AU_SALES'] (5) and the number of values in FX_rate['AUD'] (1) do not match.
I do not know what your expected output should be but I think this should work. Use multiply with the start operator *
import pandas as pd
FX_rate = {'AUD':[0.71442],'NZD':[0.68476]}
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
# use multiply with the star operator
df_USDAUD = df1['AU_SALES'].multiply(*FX_rate['AUD'])
df_USDNZD = df1['NZ_SALES'].multiply(*FX_rate['NZD'])
df_converted = df_USDAUD.append(df_USDNZD)
print(df_converted)
0 714.4200
1 1786.0500
2 535.8150
3 4858.0560
4 714.4200
0 342.3800
1 1506.4720
2 294.4468
3 68.4760
4 4567.3492
dtype: float64
Or you can create a function
# create a function
def myFun(df, aud, nzd, Dir):
df_USDAUD = df['AU_SALES'] * aud
df_USDNZD = df['NZ_SALES'] * nzd
df_converted = df_USDAUD.append(df_USDNZD)
df_converted.to_csv(Dir)
return df_converted
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
myFun(df1, 0.71442, 0.68476, Dir)
or just do not store the numbers in a list inside the dict: FX_rate = {'AUD':0.71442,'NZD':0.68476}
FX_rate = {'AUD':0.71442,'NZD':0.68476}
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
# use multiple with the star operator
df_USDAUD = df1['AU_SALES'] * FX_rate['AUD']
df_USDNZD = df1['NZ_SALES'] * FX_rate['NZD']
df_converted = df_USDAUD.append(df_USDNZD)
print(df_converted)

Time efficiency by eliminating three for loops

I have the a script similar to this:
import random
import pandas as pd
FA = []
FB = []
Value = []
df = pd.DataFrame()
df_save = pd.DataFrame(index=['min','max'])
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
numbers = list(range(24)) # FA.unique()
mix = '(pairwise combination of days and numbers, i.e. 0Monday,0Tuesday,...1Monday,1Tuesday,....)' 'I dont know how to do this combination btw'
def Calculus():
global min,max
min = df['Value'][boolean].min()
max = df['Value'][boolean].max()
for i in range(1000):
FA.append(random.randrange(0,23,1))
FB.append(random.choice(days))
Value.append(random.random())
df['FA'] = FA
df['FB'] = FB
df['FAB'] = df['FA'].astype(str) + df['FB'].astype(str)
df['Value'] = Value
mix_factor = df['FA'].astype(str) + df['FB'].astype(str)
for i in numbers:
boolean = df['FA'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in days:
boolean = df['FB'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in mix_factor.unique():
boolean = df['FAB'] == i
Calculus() #
df_save[str(i)] = [min,max]
My question is: there is another way to do the same but more time efficiently? My real data (df in this case) is a csv with millions of rows and this three loops are taking forever.
Maybe using 'apply' but I never have worked with it before.
Any insight will be very appreciate, thanks.
You could put all three loops into one, depending on what your exact code is. Is there a parameter for calculus? If not, putting them into one would allow you to have to run Calculus() less

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources