I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.
Related
I would like to perform a rolling average but with a window that only has a finite 'vision' in x. I would like something similar to what I have below, but I want a window range that based on the x value rather than position index.
While doing this within pandas is preferred numpy/scipy equivalents are also OK
import numpy as np
import pandas as pd
x_val = [1,2,4,8,16,32,64,128,256,512]
y_val = [x+np.random.random()*200 for x in x_val]
df = pd.DataFrame(data={'x':x_val,'y':y_val})
df.set_index('x', inplace=True)
df.plot()
df.rolling(1, win_type='gaussian').mean(std=2).plot()
So I would expect the first 5 values to be averaged together because they are within 10 xunits of each other, but the last values to be unchanged.
According to pandas documentation on rolling
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
Therefore, maybe you need to fake a rolling operation with various window sizes like this
test_df = pd.DataFrame({'x':np.linspace(1,10,10),'y':np.linspace(1,10,10)})
test_df['win_locs'] = np.linspace(1,10,10).astype('object')
for ind in range(10): test_df.at[ind,'win_locs'] = np.random.randint(0,10,np.random.randint(5)).tolist()
# rolling operation with various window sizes
def worker(idx_list):
x_slice = test_df.loc[idx_list,'x']
return np.sum(x_slice)
test_df['rolling'] = test_df['win_locs'].apply(worker)
As you can see, test_df is
x y win_locs rolling
0 1.0 1.0 [5, 2] 9.0
1 2.0 2.0 [4, 8, 7, 1] 24.0
2 3.0 3.0 [] 0.0
3 4.0 4.0 [9] 10.0
4 5.0 5.0 [6, 2, 9] 20.0
5 6.0 6.0 [] 0.0
6 7.0 7.0 [5, 7, 9] 24.0
7 8.0 8.0 [] 0.0
8 9.0 9.0 [] 0.0
9 10.0 10.0 [9, 4, 7, 1] 25.0
where the rolling operation is achieved with apply method.
However, this approach is significantly slower than the native rolling, for example,
test_df = pd.DataFrame({'x':np.linspace(1,10,10),'y':np.linspace(1,10,10)})
test_df['win_locs'] = np.linspace(1,10,10).astype('object')
for ind in range(10): test_df.at[ind,'win_locs'] = np.arange(ind-1,ind+1).tolist() if ind >= 1 else []
using the approach above
%%timeit
# rolling operation with various window sizes
def worker(idx_list):
x_slice = test_df.loc[idx_list,'x']
return np.sum(x_slice)
test_df['rolling_apply'] = test_df['win_locs'].apply(worker)
the result is
41.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
while using native rolling is ~x50 faster
%%timeit
test_df['rolling_native'] = test_df['x'].rolling(window=2).sum()
863 µs ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The key question remains: what do you want to achieve with the rolling mean ?
Mathematically a clean way is:
interpolate to the finest dx of the x-data
perform the rolling mean
take out the data points you want (But be careful: this step is a type of averaging too!)
Here is the code for the interpolation:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
x_val = [1,2,4,8,16,32,64,128,256,512]
y_val = [x+np.random.random()*200 for x in x_val]
df = pd.DataFrame(data={'x':x_val,'y':y_val})
df.set_index('x', inplace=True)
#df.plot()
df.rolling(5, win_type='gaussian').mean(std=200).plot()
#---- Interpolation -----------------------------------
f1 = interp1d(x_val, y_val)
f2 = interp1d(x_val, y_val, kind='cubic')
dx = np.diff(x_val).min() # get the smallest dx in the x-data set
xnew = np.arange(x_val[0], x_val[-1]+dx, step=dx)
ynew1 = f1(xnew)
ynew2 = f2(xnew)
#---- plot ---------------------------------------------
fig = plt.figure(figsize=(15,5))
plt.plot(x_val, y_val, '-o', label='data', alpha=0.5)
plt.plot(xnew, ynew1, '|', ms = 15, c='r', label='linear', zorder=1)
#plt.plot(xnew, ynew2, label='cubic')
plt.savefig('curve.png')
plt.legend(loc='best')
plt.show()
Hopefully, someone will come with a faster solution.
Meanwhile, you can use the DataFrame.iterrows() for doing that:
for idx,row in df.iterrows():
df.loc[idx, 'avg'] = df.loc[idx-10:idx, 'y'].mean()
Output:
y avg
x
1 26.540168 26.540168
2 28.255431 27.397799
4 114.941475 56.579025
8 156.347716 81.521197
16 168.563203 162.455459
32 36.054945 36.054945
64 179.384703 179.384703
128 225.098994 225.098994
256 340.718363 340.718363
512 551.927011 551.927011
I would like to build a data frame from an existing one, where each value per row is depending on the previous one. I have an initial value v0 as starting point. Let me make an example
In [126]:import pandas as pd
In [127]: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3])
In [128]: df_result = df.copy()
In [129]: v0 = 10
In [130]: for i in range(1, len(df.index)):
...: df_result.iloc[i, 0] = df.iloc[i, 0]*df_result.iloc[i-1, 0]
...:
In [131]: df_result
Out[131]:
0
0 1.000
1 1.100
2 1.320
3 1.716
In [132]:
My question is about the for loop. How can I more efficiently writing this?
I believe need first numpy.insert value v0 to first position and then call numpy.cumprod:
df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r'])
v0 = 10
df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0))
print (df)
r n
0 1.0 10.00
1 1.1 11.00
2 1.2 13.20
3 1.3 17.16
Say I have the following dataframe
import pandas as pd
df = pd.DataFrame({ 'distance':[2.0, 3.0, 1.0, 4.0],
'velocity':[10.0, 20.0, 5.0, 40.0] })
gives the dataframe
distance velocity
0 2.0 10.0
1 3.0 20.0
2 1.0 5.0
3 4.0 40.0
How can I calculate the average of the velocity column over the rolling sum of the distance column? With the example above, create a rolling sum over the last N rows in order to get a minimum cumulative distance of 5, and then calculate the average velocity over those rows.
My target output would then be like this:
distance velocity rv
0 2.0 10.0 NaN
1 3.0 20.0 15.0
2 1.0 5.0 11.7
3 4.0 40.0 22.5
where
15.0 = (10+20)/2 (2 because 3 + 2 >= 5)
11.7 = (10 + 20 + 5)/3 (3 because 1 + 3 + 2 >= 5)
22.5 = (5 + 40)/2 (2 because 4 + 1 >= 5)
Update: in Pandas-speak, my code should find the index of the reverse cumulative distance sum back from my current record (such that it is 5 or greater), and then use that index to calculate the start of the moving average.
Not a particularly pandasy solution, but it sounds like you want to do something like
df['rv'] = np.nan
for i in range(len(df)):
j = i
s = 0
while j >= 0 and s < 5:
s += df['distance'].loc[j]
j -= 1
if s >= 5:
df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()
Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
a[i] = v[j+1:i+1].mean()
df['rv'] = a
Moreover, as suggested by #JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a #numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on
pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})
I get the following results:
Method Benchmark
-----------------------------------------------
Original data frame based 4.65 s ± 325 ms
Pure numpy array based 80.8 ms ± 9.95 ms
Jitted numpy array based 766 µs ± 52 µs
Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with
#numba.jit
def numba_example():
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
for k in range(j+1, i+1):
a[i] += v[k]
a[i] /= (i-j)
df['rv'] = a
then the benchmark reduces to 158 µs ± 8.41 µs.
Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)
How about
df.rolling(window=3, min_periods=2).mean()
distance velocity
0 NaN NaN
1 2.500000 15.000000
2 2.000000 11.666667
3 2.666667 21.666667
To combine them
df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()
It looks like something's a little off with the window shape.
Lets say I have a pandas.Dataframe that looks as follows:
c1 | c2
-------
1 | 5
2 | 6
3 | 7
4 | 8
.....
1 | 7
and I'm looking to map a function (DataFrame.corr) but I would like it to take n rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n rows of data.
Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.
With pandas 0.20, using rolling with corr produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])
c1 c2
0 0 2
1 7 3
2 8 7
3 0 6
4 8 6
5 0 2
6 0 4
7 9 7
8 3 2
9 4 3
rolling + corr... pandas 0.20.x
df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c1, dtype: float64
rolling + corr... pandas 0.19.x or prior
Prior to 0.20, rolling + corr produced a pd.Panel
df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c2, dtype: float64
numpy + as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2 dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.
from numpy.lib.stride_tricks import as_strided as strided
def rolling_correlation(a, w):
n, m = a.shape[0], 2
s1, s2 = a.strides
b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
b_mb = b - b.mean(1, keepdims=True)
b_ss = (b_mb ** 2).sum(1) ** .5
return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])
def rolling_correlation_df(df, w):
a = df.values
return pd.Series(rolling_correlation(a, w), df.index[w-1:])
rolling_correlation_df(df, 5)
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
dtype: float64
Timing
small data
%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])
%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop
I have a data frame with the following columns: {'day','measurement'}
And there might be several measurements in a day (or no measurements at all)
For example:
day | measurement
1 | 20.1
1 | 20.9
3 | 19.2
4 | 20.0
4 | 20.2
and an array of coefficients:
coef={-1:0.2, 0:0.6, 1:0.2}
My goal is to resample the data and average it using the coefficiets, (missing data should be left out).
This is the code I wrote to calculate that
window=[-1,0,-1]
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
For the example above, the output should be:
Day measurement
1 20.500
2 19.850
3 19.425
4 19.875
The problem is that the code runs forever, and I'm pretty sure that there's a better way to resample with coefficients.
Any advice would be highly appreciated !
Here's a possible solution to what you're looking for:
# This is your data
In [2]: data = pd.DataFrame({
...: 'day': [1, 1, 3, 4, 4],
...: 'measurement': [20.1, 20.9, 19.2, 20.0, 20.2]
...: })
# Pre-compute every day's average, filling the gaps
In [3]: measurement = data.groupby('day')['measurement'].mean()
In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1))
In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2})
# Create a matrix with the time-shifted measurements
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()})
In [7]: matrix
Out[7]:
-1 0 1
day
1 NaN 20.5 NaN
2 19.2 NaN 20.5
3 20.1 19.2 NaN
4 NaN 20.1 19.2
# Take a weighted average of the matrix
In [8]: (matrix * coef).sum(axis=1) / (matrix.notnull() * coef).sum(axis=1)
Out[8]:
day
1 20.500
2 19.850
3 19.425
4 19.875
dtype: float64