I have a pandas DataFrame and I want to calculate on a rolling basis the average of all the value: for all the columns, for all the observations in the rolling window.
I have a solution with loops but feels very inefficient. Note that I can have NaNs in my data, so calculating the sum and dividing by the shape of the window would not be safe (as I want a nanmean).
Any better approach?
Setup
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])
df[df>5] = np.nan # EDIT: add nans
My Attempt
n_roll = 2
df_stacked = df.values
roll_avg = {}
for idx in range(n_roll, len(df_stacked)+1):
roll_avg[idx-1] = np.nanmean(df_stacked[idx - n_roll:idx, :].flatten())
roll_avg = pd.Series(roll_avg)
roll_avg.index = df.index[n_roll-1:]
roll_avg = roll_avg.reindex(df.index)
Desired Result
roll_avg
Out[33]:
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
Thanks!
Here's one NumPy solution with sliding windows off view_as_windows -
from skimage.util.shape import view_as_windows
# Setup o/p array
out = np.full(len(df),np.nan)
# Get sliding windows of length n_roll along axis=0
w = view_as_windows(df.values,(n_roll,1))[...,0]
# Assign nan-ignored mean values computed along last 2 axes into o/p
out[n_roll-1:] = np.nanmean(w, (1,2))
Memory efficiency with views -
In [62]: np.shares_memory(df,w)
Out[62]: True
To be able to get the same result in case of nan, you can use column_stack on all the df.shift(i).values for i in range(n_roll), use nanmean on axis=1, and then you need to replace the first n_roll-1 value with nan after:
roll_avg = pd.Series(np.nanmean(np.column_stack([df.shift(i).values for i in range(n_roll)]),1))
roll_avg[:n_roll-1] = np.nan
and with the second input with nan, you get as expected
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
dtype: float64
Using the answer referenced in the comment, one can do:
wsize = n_roll
cols = df.shape[1]
out = group.stack(dropna=False).rolling(window=wsize * cols, min_periods=1).mean().reset_index(-1, drop=True).sort_index()
out.groupby(out.index).last()
out.iloc[:nroll-1] = np.nan
In my case it was important to specify dropna=False in stack, otherwise the length of the rolling window would not be correct.
But I am looking forward to other approaches as this does not feel very elegant/efficient.
Related
My question is very similar to here, except I would like to round to closest, instead of always round up, so cut() doesn't seem to work.
import pandas as pd
import numpy as np
df = pd.Series([11,16,21, 125])
rounding_logic = pd.Series([15, 20, 100])
labels = rounding_logic.tolist()
rounding_logic = pd.Series([-np.inf]).append(rounding_logic) # add infinity as leftmost edge
pd.cut(df, rounding_logic, labels=labels).fillna(rounding_logic.iloc[-1])
The result is [15,20,100,100], but I'd like [15,15,20,100], since 16 is closest to 15 and 21 closest to 20.
You can try pandas.merge_asof with direction=nearest
out = pd.merge_asof(df.rename('1'), rounding_logic.rename('2'),
left_on='1',
right_on='2',
direction='nearest')
print(out)
1 2
0 11 15
1 16 15
2 21 20
3 125 100
Get the absolute difference for the values, and take minimum value from rounding_logic:
>>> rounding_logic.reset_index(drop=True, inplace=True)
>>> df.apply(lambda x: rounding_logic[rounding_logic.sub(x).abs().idxmin()])
0 15.0
1 15.0
2 20.0
3 100.0
dtype: float64
PS: You need to reset the index in rounding_logic cause you have duplicate index after adding -inf to the start of the series.
I am working in pandas and want to implement an algorithm that requires I assess a modified centered median on a window, but omitting the middle value. So for instance the unmodified might be:
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5,center=True).median()
print(med)
and I would like the result for med[3] to be 3.5 (the median of 1.,2.,4.,6.) rather than 4.5 which the ordinary windowed median. Is there an economical way to do this?
Try:
import numpy as np
import pandas as pd
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5).apply(lambda x: np.median(np.concatenate([x[0:2],x[3:5]]))).shift(-2)
print(med)
With output:
0 NaN
1 NaN
2 2.75
3 3.50
4 5.25
5 6.50
6 NaN
7 NaN
And more generally:
rolling_size = 5
ser.rolling(rolling_size).apply(lambda x: np.median(np.concatenate([x[0:int(rolling_size/2)],x[int(rolling_size/2)+1:rolling_size]]))).shift(-int(rolling_size/2))
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
def median(series, window = 2):
df = pd.DataFrame(series[window:].reset_index(drop=True))
df[1] = series[:-window]
df = df.apply(lambda x: x.mean(), axis=1)
df.index += window - 1
return df
median(ser)
I think it is simpler
I work with large data sheets, in which I am trying to correlate all of the columns. I achieve this using:
df = df.rolling(5).corr(pairwise = True)
This produces data like this:
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
for each row contained in the data set. 477 in this case being the row number or index, and s1 - s5 being the column titles.
The goal is to find when the sensors are highly correlated with each other. I want to achieve this by (a) calculating the correlation using a rolling window of 5 rows using the code above, and (b) for each row produced, i.e i = 0 to i = 500 for a 500 row excel sheet, sum the tables dataframe.rolling(5).corr() produces for each value of i, i.e. produce one value per unit time such as in the graph included at the bottom. I am new to stackoverflow so please let me know if there's more information I can provide.
Example code + data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
MATLAB code which accomplishes what I want:
% move through the data and get a correlation for 5 data points
for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));
cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change
end
For the original data, the following is the graph I am trying to produce in Python, where the x axis is time:
Correlation Graph
sum the entire table in both directions, and subtract the diagonal of 1's which is the sensors correlated with themselves.
Using your dfn row four is
>>> dfn.loc[4]
col1 col2 col3
col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
You can sum the complete table using Numpy's ndarray.sum() on the underlying data
>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103
Then assuming the correlation table is square you just need to subtract the number of columns/sensors. If there isn't already a variable you can use the shape of the underlying numpy array.
>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>
without using the numpy array, you could sum the correlation table twice before subtracting.
>>> four = dfn.loc[4]
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029
Get the numpy array of the whole rolling sum correlation and reshape it to get separate correlations for each original row
n_sensors = 3
v = dfn.to_numpy() # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])
[[ 1. -0.14697697 -0.22705934]
[-0.14697697 1. 0.43521648]
[-0.22705934 0.43521648 1. ]]
Sum across the last two dimensions and subtract the number of sensors
result = v.sum((1,2)) - n_sensors
print(result)
[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]
There is probably a way to do that in Pandas but I'd have to work on that to figure it out. Maybe someone will answer with an all Pandas solution.
The rolling average correlation DataFrame has a multiindex
>>> dfn.index
MultiIndex([(0, 'col1'),
(0, 'col2'),
(0, 'col3'),
(1, 'col1'),
(1, 'col2'),
(1, 'col3'),
(2, 'col1'),
(2, 'col2'),
(2, 'col3'),
...
With a quick review of the MultiIndex docs and a search using pandas multi index sum on level 0 site:stackoverflow.com I came up with - group by level 0 and sum then sum again along the columns.
>>> four_five = dfn.loc[[4,5]]
>>> four_five
col1 col2 col3
4 col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
5 col1 1.000000 0.191238 -0.644203
col2 0.191238 1.000000 0.579545
col3 -0.644203 0.579545 1.000000
>>> four_five.groupby(level=0).sum()
col1 col2 col3
4 0.625964 1.288240 1.208157
5 0.547035 1.770783 0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4 3.12236
5 3.25316
dtype: float64
>>>
Then for the complete DataFrame.
>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0 -3.000000
1 -3.000000
2 -3.000000
3 -3.000000
4 0.122360
5 0.253160
6 -2.407632
7 -1.937020
8 -2.280236
9 -2.578865
dtype: float64
Reading a few more of the answers from that search (I should have looked at the DataFrame.sum docs closer)
>>> dfn.sum(level=0).sum(1) - n_sensors
0 -3.000000
1 -3.000000
2 -3.000000
3 -3.000000
4 0.122360
5 0.253160
6 -2.407632
7 -1.937020
8 -2.280236
9 -2.578865
dtype: float64
I have a multi-index hierarchy set up as follows:
import numpy as np
sectors = ['A','B','C','D']
ports = ['pf','bm']
dates = range(1,11)*2
wts, pchg = zip(*np.random.randn(20,2))
df = pd.DataFrame(dict(dates=dates,port=sorted(ports*10),
sector=np.random.choice(sectors,20), wts=wts,
pchg=pchg))
df = df.set_index(['port','sector','dates'])
df = df.unstack('port')
df = df.fillna(0)
I'd like group by dates and port , and sum pchg * wts
I've been through the docs but I'm struggling to figure this out.
Any help greatly appreciated. Thanks
You indeed do not need to unstack to get what you want, using the product method to do the multiplication you want. Step by step:
Starting from this dataframe:
In [50]: df.head()
Out[50]:
pchg wts
port bm pf bm pf
sector dates
A 1 0.138996 0.451688 0.763287 -1.863401
3 1.081863 0.000000 0.956807 0.000000
4 0.207065 0.000000 -0.663175 0.000000
5 0.258293 -0.868822 0.109336 -0.784900
6 -1.016700 0.900241 -0.054077 -1.253191
We can first do the pchg * wts part with the product method, multiplying over axis 1, but only for the second level:
In [51]: df.product(axis=1, level=1).head()
Out[51]:
port bm pf
sector dates
A 1 0.106094 -0.841675
3 1.035134 0.000000
4 -0.137320 0.000000
5 0.028241 0.681938
6 0.054980 -1.128174
And then we can just group by dates (and no grouping by port needed anymore) and take the sum:
In [52]: df.product(axis=1, level=1).groupby(level='dates').sum()
Out[52]:
port bm pf
dates
1 0.106094 -0.841675
2 0.024968 1.357746
3 1.035134 1.776464
4 -0.137320 0.392312
5 0.028241 0.681938
6 0.054980 -1.128174
7 0.140183 -0.338828
8 1.296028 -1.526065
9 -0.213989 0.469104
10 0.058369 -0.006564
This gives the same output as
df.stack('port').groupby(level=[1,2]).apply(lambda x: (x['wts']*x["pchg"]).sum()).unstack('port')
I have a Pandas data frame 'df' in which I'd like to perform some scalings column by column.
In column 'a', I need the maximum number to be 1, the minimum number to be 0, and all other to be spread accordingly.
In column 'b', however, I need the minimum number to be 1, the maximum number to be 0, and all other to be spread accordingly.
Is there a Pandas function to perform these two operations? If not, numpy would certainly do.
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
This is how you can do it using sklearn and the preprocessing module. Sci-Kit Learn has many pre-processing functions for scaling and centering data.
In [0]: from sklearn.preprocessing import MinMaxScaler
In [1]: df = pd.DataFrame({'A':[14,90,90,96,91],
'B':[103,107,110,114,114]}).astype(float)
In [2]: df
Out[2]:
A B
0 14 103
1 90 107
2 90 110
3 96 114
4 91 114
In [3]: scaler = MinMaxScaler()
In [4]: df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
In [5]: df_scaled
Out[5]:
A B
0 0.000000 0.000000
1 0.926829 0.363636
2 0.926829 0.636364
3 1.000000 1.000000
4 0.939024 1.000000
You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.
In [11]: df
Out[11]:
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
In [12]: df -= df.min() # equivalent to df = df - df.min()
In [13]: df /= df.max() # equivalent to df = df / df.max()
In [14]: df
Out[14]:
a b
A 0.000000 0.000000
B 0.926829 0.363636
C 0.926829 0.636364
D 1.000000 1.000000
E 0.939024 1.000000
To switch the order of a column (from 1 to 0 rather than 0 to 1):
In [15]: df['b'] = 1 - df['b']
An alternative method is to negate the b columns first (df['b'] = -df['b']).
In case you want to scale only one column in the dataframe, you can do the following:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Col1_scaled'] = scaler.fit_transform(df['Col1'].values.reshape(-1,1))
This is not very elegant but the following works for this two column case:
#Create dataframe
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
#Apply operates on each row or column with the lambda function
#axis = 0 -> act on columns, axis = 1 act on rows
#x is a variable for the whole row or column
#This line will scale minimum = 0 and maximum = 1 for each column
df2 = df.apply(lambda x:(x.astype(float) - min(x))/(max(x)-min(x)), axis = 0)
#Want to now invert the order on column 'B'
#Use apply function again, reverse numbers in column, select column 'B' only and
#reassign to column 'B' of original dataframe
df2['B'] = df2.apply(lambda x: 1-x, axis = 1)['B']
If I find a more elegant way (for example, using the column index: (0 or 1)mod 2 - 1 to select the sign in the apply operation so it can be done with just one apply command, I'll let you know.
I think Acumenus' comment in this answer, should be mentioned explicitly as an answer, as it is a one-liner.
>>> import pandas as pd
>>> from sklearn.preprocessing import minmax_scale
>>> df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
>>> minmax_scale(df)
array([[0. , 0. ],
[0.92682927, 0.36363636],
[0.92682927, 0.63636364],
[1. , 1. ],
[0.93902439, 1. ]])
given a data frame
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
scale with mean 0 and var 1
df.apply(lambda x: (x - np.mean(x)) / np.std(x), axis=0)
scale with range between 0 and 1
df.apply(lambda x: x / np.max(x), axis=0)