Pandas: vectorization of Conditional Cumulative Sum - python

I'm trying to vectorize a for loop in pandas to improve performance. I have a dataset comprising of users, products, the date of each service as well as the number of days supplied. Given the following subset of data:
testdf = pd.DataFrame(data={"USERID": ["A"] * 6,
"PRODUCTID": [1] * 6,
"SERVICEDATE": [datetime(2016, 1, 1), datetime(
2016, 2, 5),
datetime(2016, 2, 28), datetime(2016, 3, 25),
datetime(2016, 4, 30), datetime(2016, 5, 30)],
"DAYSSUPPLY": [30] * 6})
testdf=testdf.set_index(["USERID", "PRODUCTID"])
testdf["datediff"] = testdf["SERVICEDATE"].diff()
testdf.loc[testdf["datediff"].notnull(), "datediff"] = testdf.loc[
testdf["datediff"].notnull(), "datediff"].apply(lambda x: x.days)
testdf["datediff"] = testdf["datediff"].fillna(0)
testdf["datediff"] = pd.to_numeric(testdf["datediff"])
testdf["over_under"] = testdf["DAYSSUPPLY"].shift() - testdf["datediff"]
I would like to get the following result:
DAYSSUPPLY SERVICEDATE datediff over_under desired
USERID PRODUCTID
A 1 30 2016-01-01 0 NaN 0
1 30 2016-02-05 35 -5.0 0
1 30 2016-02-28 23 7.0 7
1 30 2016-03-25 26 4.0 11
1 30 2016-04-30 36 -6.0 5
1 30 2016-05-30 30 0.0 5
Essentially, I want my desired column to be the running sum of over_under, but to only sum the negative values if the value of desired on the previous line is > 0. desired should never get below 0. A quick and dirty loop over a [user, product] group looks something like this:
running_total = 0
desired_loop = []
for row in testdf.itertuples():
over_under=row[4]
# skip first row
if pd.isnull(over_under):
desired_loop.append(0)
continue
running_total += over_under
running_total = max(running_total, 0)
desired_loop.append(running_total)
testdf["desired_loop"] = desired_loop
desired_loop
USERID PRODUCTID
A 1 0.0
1 0.0
1 7.0
1 11.0
1 5.0
1 5.0
I'm still new to vectorization and pandas and general. I've been able to vectorize every other calculation in this df, but this special case of a cumulative sum I just can't figure out how to go about it.
Thanks!

I had a similar problem and solved it using a somewhat unconventional iteration.
testdf["desired"] = testdf["over_under"].cumsum()
current = np.argmax( testdf["desired"] < 0 )
while current != 0:
testdf.loc[current:,"desired"] += testdf["desired"][current] # adjust the cumsum going forward
# the previous statement also implicitly sets
# testdf.loc[current, "desired"] = 0
current = np.argmax( testdf["desired"][current:] < 0 )
In essence, you are finding all the "events" and readjusting the running cumsum over time. All of the manipulation and test operations are vectorized, so if your desired column doesn't cross negative too often, you should be pretty fast.
It's definitely a hack but it got the job done for me.

Related

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!
I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.
You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values

Pandas pct_change with moving average

I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?
here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8

Creating flexible, iterative field name in Python function or loop

I am creating a DataFrame with the code below:
import pandas as pd
df1= pd.DataFrame({'segment': ['abc','abc','abc','abc','abc','xyz','xyz','xyz','xyz','xyz','xyz','xyz'],
'prod_a_clients': [5,0,12,25,0,2,5,24,0,1,21,7],
'prod_b_clients': [15,6,0,12,8,0,17,0,2,23,15,0] })
abc_seg= df1[(df1['segment']=='abc')]
xyz_seg= df1[(df1['segment']=='xyz')]
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_a_clients']>0)]
abc_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_b_clients']>0)]
abc_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_a_clients']>0)]
xyz_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_b_clients']>0)]
xyz_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
segs_combined= [abc_seg,xyz_seg]
df2= pd.concat(segs_combined, ignore_index=True)
print(df2)
As you can see from the result I need to calculate a mean for every product and segment combination I have. I'm going to be doing this for 100s of products and segments. I have tried many different ways of doing this with a loop or a function and have gotten close with something like the following:
def prod_seg(sg,prd):
seg_prod= df1[(df1['segment']==sg) & (df1[prd+'_clients']>0)]
prod_name= prd+'_clients'
col_name= prd+'_average'
df_name= sg+'_seg'
df_name+"['"+prd+'_average'+"']"=statistics.mean(seg_prod[prod_name])
return
The issue is that I need to create a unique column for every iteration and the way I am doing it above is obviously not working.
Is there any way I can recreate what I did above in a loop or function?
You could use groupby in order to calculate the mean per group. Also, replace the 0 with nan and it gets skipped by the mean calculation. The script then looks like:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'segment': ['abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz',
'xyz', 'xyz', 'xyz'],
'prod_a_clients': [5, 0, 12, 25, 0, 2, 5, 24, 0, 1, 21, 7],
'prod_b_clients': [15, 6, 0, 12, 8, 0, 17, 0, 2, 23, 15, 0]})
df1.set_index("segment", inplace=True, drop=True)
df1[df1 == 0] = np.nan
mean_values = dict()
for seg_key, seg_df in df1.groupby(level=0):
mean_value = seg_df.mean(numeric_only=True)
mean_values[seg_key] = mean_value
results = pd.DataFrame.from_dict(mean_values)
print(results)
The results is:
abc xyz
prod_a_clients 14.00 10.00
prod_b_clients 10.25 14.25
Instead of using a loop, you can derive the same result by first using where on the 0s in the clients columns (which replaces 0s with NaN); then groupby the "segments" column and transform the "mean" method.
The point of where is that mean method by default skips NaN values, so by converting 0s with NaN, we make sure 0s are not considered for the mean.
transform(mean) transforms the mean (which is an aggregate value) to align with the original DataFrame, so every row has a matching mean value.
clients = ['prod_a_clients', 'prod_b_clients']
out = (df1.join(df1[['segment']]
.join(df1[clients].where(df1[clients]>0))
.groupby('segment').transform('mean')
.add_suffix('_mean')))
Output:
segment prod_a_clients prod_b_clients prod_a_clients_mean prod_b_clients_mean
0 abc 5 15 14.0 10.25
1 abc 0 6 14.0 10.25
2 abc 12 0 14.0 10.25
3 abc 25 12 14.0 10.25
4 abc 0 8 14.0 10.25
5 xyz 2 0 10.0 14.25
6 xyz 5 17 10.0 14.25
7 xyz 24 0 10.0 14.25
8 xyz 0 2 10.0 14.25
9 xyz 1 23 10.0 14.25
10 xyz 21 15 10.0 14.25
11 xyz 7 0 10.0 14.25

Pandas - New column based on the value of another column N rows back, when N is stored in a column

I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new

How to get average of increasing values using Pandas?

I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0

Categories

Resources