Rolling sum of the next N elements, including the current element - python

Good Morning,
I have the following dataframe:
a = [1,2,3,4,5,6]
b = pd.DataFrame({'a': a})
I would like to create a column that sums the next "n" rows of column "a", including the present value of a; I tried:
n = 2
b["r"] = pd.rolling_sum(b.a, n) + a
print(b)
a r
0 1 NaN
1 2 5.0
2 3 8.0
3 4 11.0
4 5 14.0
5 6 17.0
It would be delightful to have:
a r
0 1 1 + 2 + 3 = 6
1 2 2 + 3 + 4 = 9
2 3 3 + 4 + 5 = 12
3 4 4 + 5 + 6 = 15
4 5 5 + 6 + 0 = 11
5 6 6 + 0 + 0 = 6

pandas >= 1.1
Pandas now supports "forward-looking window operations", see here.
From 1.1, you can use FixedForwardWindowIndexer
idx = pd.api.indexers.FixedForwardWindowIndexer
b['a'].rolling(window=idx(window_size=3), min_periods=1).sum()
0 6.0
1 9.0
2 12.0
3 15.0
4 11.0
5 6.0
Name: a, dtype: float64
Note that this is still (at the time of writing) very buggy for datetime rolling operations - use with caution.
pandas <= 1.0.X
Without builtin support, you can get your output by first reversing your data, using rolling_sum with min_periods=1, and reverse again.
b.a[::-1].rolling(3, min_periods=1).sum()[::-1]
0 6.0
1 9.0
2 12.0
3 15.0
4 11.0
5 6.0
Name: a, dtype: float64

Related

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

How to apply function vertically in df

I want to add column values vertically from top to down
def add(x,y):
return x,y
df = pd.DataFrame({'A':[1,2,3,4,5]})
df['add'] = df.apply(lambda row : add(row['A'], axis = 1)
I tried using apply but its not working
Desired output is basically adding A column values 1+2, 2+3:
A add
0 1 1
1 2 3
2 3 5
3 4 7
4 5 9
You can apply rolling.sum on a moving window of size 2:
df.A.rolling(2, min_periods=1).sum()
0 1.0
1 3.0
2 5.0
3 7.0
4 9.0
Name: A, dtype: float64
Try this instead:
>>> df['add'] = (df + df.shift()).fillna(df)['A']
>>> df
A add
0 1 1.0
1 2 3.0
2 3 5.0
3 4 7.0
4 5 9.0
>>>

Continuous update of Aggregation of last 5 data sets in python

I need to add a new feature that aggregates the last 5 data. When it adds 6th data, then it should forget the first data and consider only the last 5 data sets as shown below. Here is the dummy data frame, new_feature is the expected output.
id feature new_feature
1 a a
2 b a+b
3 c a+b+c
4 d a+b+c+d
5 e a+b+c+d+e
6 f b+c+d+e+f
7 g c+d+e+f+g
Use Series.rolling with min_periods=1 parameter and sum:
df = pd.DataFrame({'feature':[1,2,4,5,6,2,3,4,5]})
df['new_feature'] = df['feature'].rolling(5, min_periods=1).sum()
print (df)
feature new_feature
0 1 1.0
1 2 3.0
2 4 7.0
3 5 12.0
4 6 18.0
5 2 19.0
6 3 20.0
7 4 20.0
8 5 20.0

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

(Speed Calculate) How Make Simple way?

I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark

Categories

Resources