Rolling sum on a dynamic window - python

I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3

You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.

here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3

An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3

Related

How to remove consecutive pairs of opposite numbers from Pandas Dataframe?

How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks
Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

How to calculate amounts that row values greater than a specific value in pandas?

How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64

Divide part of a dataframe by another while keeping columns that are not being divided

I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904

Categories

Resources