Using groupby, shift and rolling in Pandas

Using groupby, shift and rolling in Pandas - python

I am trying to calculate rolling averages within groups. For this task I want a rolling average from the rows above so thought the easiest way would be to use shift() and then do rolling(). The problem is that shift() shifts the data from previous groups which makes first row in group 2 and 3 incorrect. Column 'ma' should have NaN in rows 4 and 7. How can I achieve this?
import pandas as pd
df = pd.DataFrame(
{"Group": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"Value": [2.5, 2.9, 1.6, 9.1, 5.7, 8.2, 4.9, 3.1, 7.5]
})
df = df.sort_values(['Group'])
df.reset_index(inplace=True)
df['ma'] = df.groupby('Group', as_index=False)['Value'].shift(1).rolling(3, min_periods=1).mean()
print(df)
I get this:
index Group Value ma
0 0 1 2.5 NaN
1 3 1 9.1 2.50
2 6 1 4.9 5.80
3 1 2 2.9 5.80
4 4 2 5.7 6.00
5 7 2 3.1 4.30
6 2 3 1.6 4.30
7 5 3 8.2 3.65
8 8 3 7.5 4.90
I tried answers from couple similar questions but nothing seems to work.

If I understand the question correctly, then the solution you require can be achieved in 2 steps using the following:
df['sa'] = df.groupby('Group', as_index=False)['Value'].transform(lambda x: x.shift(1))
df['ma'] = df.groupby('Group', as_index=False)['sa'].transform(lambda x: x.rolling(3, min_periods=1).mean())
I got the below output, where 'ma' is the desired column
index Group Value sa ma
0 0 1 2.5 NaN NaN
1 3 1 9.1 2.5 2.5
2 6 1 4.9 9.1 5.8
3 1 2 2.9 NaN NaN
4 4 2 5.7 2.9 2.9
5 7 2 3.1 5.7 4.3
6 2 3 1.6 NaN NaN
7 5 3 8.2 1.6 1.6
8 8 3 7.5 8.2 4.9
Edit: Example with one groupby
def shift_ma(x):
return x.shift(1).rolling(3, min_periods=1).mean()
df['ma'] = df.groupby('Group', as_index=False)['Value'].apply(shift_ma).reset_index(drop=True)

Related

Pandas dataframe range check using between and rolling

I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'

You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0

Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0

In pandas, how to assign the result of a groupby aggregate to the next group in the original df?

Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')

Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64

You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3

How to remove consecutive bad data points in Pandas

I have a Pandas dataframe that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Dummy_Var": [1]*12,
"B": [6, 143.3, 143.3, 143.3, 3, 4, 93.9, 93.9, 93.9, 2, 2, 7],
"C": [4.1, 23.2, 23.2, 23.2, 4.3, 2.5, 7.8, 7.8, 2, 7, 7, 7]})
B C Dummy_Var
0 6.0 4.1 1
1 143.3 23.2 1
2 143.3 23.2 1
3 143.3 23.2 1
4 3.0 4.3 1
5 4.0 2.5 1
6 93.9 7.8 1
7 93.9 7.8 1
8 93.9 2.0 1
9 2.0 7.0 1
10 2.0 7.0 1
11 7.0 7.0 1
Whenever the same numbers show up consecutively three times or more in a row, that data should be replaced with NAN. So the result should be:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
I have written a function that does that:
def non_sense_remover(df, examined_columns, allowed_repeating):
def count_each_group(grp, column):
grp['Count'] = grp[column].count()
return grp
for col in examined_columns:
sel = df.groupby((df[col] != df[col].shift(1)).cumsum()).apply(count_each_group, column=col)["Count"] > allowed_repeating
df.loc[sel, col] = np.nan
return df
df = non_sense_remover(df, ["B", "C"], 2)
However, my real dataframe has 2M rows and 18 columns! It is very very slow to run this function on 2M rows. Is there a more efficient way to do this? Am I missing something? Thanks in advance.

Constructing a boolean mask in this situation will be far more efficient than a solution based on apply(), particularly for large datasets. Here is an approach:
cols = df[['B', 'C']]
mask = (cols.shift(-1) == cols) & (cols.shift(1) == cols)
df[mask | mask.shift(1).fillna(False) | mask.shift(-1).fillna(False)] = np.nan
Edit:
For a more general approach, replacing sequences of length N with NaN, you could do something like this:
from functools import reduce
from operator import or_, and_
def replace_sequential_duplicates_with_nan(df, N):
mask = reduce(and_, [cols.shift(i) == cols.shift(i + 1)
for i in range(N - 1)])
full_mask = reduce(or_, [mask.shift(-i).fillna(False)
for i in range(N)])
df[full_mask] = np.nan

We using groupby + mask
m=df[['B','C']]
df[['B','C']]=m.mask(m.apply(lambda x : x.groupby(x.diff().ne(0).cumsum()).transform('count'))>2)
df
Out[1245]:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1

From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it

Right way to update the data in a table?

I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?

Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333

pandas Series: calculate means between neighbours

I have pandas Series and want to compute means between elements that are neighbours.
For example [1 2 3 4 5 6 7 8 9] would give the result [1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5]

Try using rolling for this:
S = pd.Series(range(1,10))
S1 = S.rolling(2).mean().dropna()
Output:
1 1.5
2 2.5
3 3.5
4 4.5
5 5.5
6 6.5
7 7.5
8 8.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using groupby, shift and rolling in Pandas - python

Related

Pandas dataframe range check using between and rolling

In pandas, how to assign the result of a groupby aggregate to the next group in the original df?

How to remove consecutive bad data points in Pandas

Right way to update the data in a table?

pandas Series: calculate means between neighbours

Categories

Resources