Sliding windows in numpy with varying window size - python

I am generating data with a timestamp (counting up). I then want to seperate the array based on the timestamp and calculate the mean of the data in each window. My new array then has a new "timestamp" and the calculated mean data.
My Code is working as it is supposed to, but I do believe there is a more numpy-like way. I believe the while loop can be removed and np.where checking the whole array, as it is already sorted as-well.
Thanks for your help.
# generating test data, first row timestamps, always counting up and random data
data = np.array([np.cumsum(np.random.randint(100, size=20)), np.random.randint(1, 5, size=20)])
print(data)
window_size = 200
overlap = 100
i, l_lim, u_lim = 0, 0, window_size
timestamps = []
window_mean = []
while u_lim < data[0, -1]:
window_mean.append(np.mean(data[1, np.where((data[0, :] > l_lim) & (data[0, :] <= u_lim))]))
timestamps.append(i)
l_lim = u_lim - overlap
u_lim = l_lim + window_size
i += 1
print(np.array([timestamps, window_mean]))

While I may have reduced the number of lines of code, I do not think I have really improved it that much. The main difference is the method of iteration, and its use to define the number selection boundaries, but otherwise, I could not see any way to improve on your code. Here is my attempt for what it is worth:
Code:
import numpy as np
np.random.seed(5)
data = np.array([np.cumsum(np.random.randint(100, size=20)), np.random.randint(1, 5, size=20)])
print("Data:", data)
window_size = 200
overlap = 100
for i in range((max(data[0]) // (window_size-overlap)) + 1):
result = np.mean(data[1, np.where((data[0] > i*(window_size-overlap)) & (data[0] <= (i*(window_size-overlap)) + window_size))])
print(f"{i}: {result:.2f}")
Output:
Data: [[ 99 177 238 254 327 335 397 424 454 534 541 617 632 685 765 792 836 913 988 1053]
[ 4 3 1 3 2 3 3 2 2 3 2 2 3 2 3 4 1 3 2 3]]
0: 3.50
1: 2.33
2: 2.40
3: 2.40
4: 2.25
5: 2.40
6: 2.80
7: 2.67
8: 2.00
9: 2.67
10: 3.00

Related

Rolling average with window size an interval of column values

I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ยฑ 335 ยตs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ยฑ 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.

Python numpy vectorization for heat dispersion

I'm supposed to write a code to represent heat dispersion using the finite difference formula given below.
๐‘ข(๐‘ก)๐‘–๐‘—=(๐‘ข(๐‘กโˆ’1)[๐‘–+1,๐‘—] + ๐‘ข(๐‘กโˆ’1) [๐‘–โˆ’1,๐‘—] +๐‘ข(๐‘กโˆ’1)[๐‘–,๐‘—+1] + ๐‘ข(๐‘กโˆ’1)[๐‘–,๐‘—โˆ’1])/4
The formula is supposed to produce the result only for a time step of 1. So, if an array like this was given:
100 100 100 100 100
100 0 0 0 100
100 0 0 0 100
100 0 0 0 100
100 100 100 100 100
The resulting array at time step 1 would be:
100 100 100 100 100
100 50 25 50 100
100 25 0 25 100
100 50 25 50 100
100 100 100 100 100
I know the representation using for loops would be as follows, where the array would have a minimum of 2 rows and 2 columns as a precondition:
h = np.copy(u)
for i in range(1,h.shape[0]-1):
for j in range (1, h.shape[1]-1):
num = u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1]
h[i][j] = num/4
But I cannot figure out how to vectorize the code to represent heat dispersion. I am supposed to use numpy arrays and vectorization and am not allowed to use for loops of any kind, and I think I am supposed to rely on slicing, but I cannot figure out how to write it out and have started out with.
r, c = h.shape
if(c==2 or r==2):
return h
I'm sure that if the rows=2 or columns =2 then the array is returned as is, but correct me if Im wrong. Any help would be greatly appreciated. Thank you!
Try:
h[1:-1,1:-1] = (h[2:,1:-1] + h[:-2,1:-1] + h[1:-1,2:] + h[1:-1,:-2]) / 4
This solution uses slicing where:
1:-1 stays for indices 1,2, ..., LAST - 1
2: stays for 2, 3, ..., LAST
:-2 stays for 0, 1, ..., LAST - 2
During each iteration only the inner elements (indices 1..LAST-1) are updated

A column in dataframe is automatically converted to float type [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a dataframe like this
and when I apply a function on it like this
median = Top15['% Renewable'].median(axis=0)
def func(Top15):
if (Top15['% Renewable'] >= median):
Top15['HighRenew'] = 1
else:
Top15['HighRenew'] = 0
return Top15
Top15.apply(func,axis=1)
The Rank column is converted to float and I don't know why it is
First I cannot simulate your problem.
I think better is compare to boolean mask and convert to int by astype True as 1 and False to 0:
Top15['Rank'] = (Top15['% Renewable'] >= Top15['% Renewable'].median(axis=0)).astype(int)
Main reason why avoid apply (if possible) are loops under the hood.
Sample:
Top15 = pd.DataFrame({'% Renewable':[10,23,56,78,90],
'Rank':[10,20,30,4,50]})
print (Top15)
#Top15 = pd.concat([Top15] * 1000, ignore_index=True)
% Renewable Rank
0 10 10
1 23 20
2 56 30
3 78 4
4 90 50
median = Top15['% Renewable'].median(axis=0)
def func(x):
if (x['% Renewable'] >= median):
x['HighRenew'] = 1
else:
x['HighRenew'] = 0
return x
Top15 = Top15.apply(func,axis=1)
Top15['Rank2'] = (Top15['% Renewable'] >= Top15['% Renewable'].median(axis=0)).astype(int)
print (Top15)
% Renewable Rank HighRenew Rank2
0 10 10 0 0
1 23 20 0 0
2 56 30 1 1
3 78 4 1 1
4 90 50 1 1
Timings:
Top15 = pd.DataFrame({'% Renewable':[10,23,56,78,90],
'Rank':[10,20,30,4,50]})
print (Top15)
Top15 = pd.concat([Top15] * 1000, ignore_index=True)
In [49]: %timeit Top15.apply(func,axis=1)
1 loop, best of 3: 595 ms per loop
In [50]: %timeit (Top15['% Renewable'] >= Top15['% Renewable'].median(axis=0)).astype(int)
The slowest run took 5.19 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 346 ยตs per loop
You can use astype
Top15['Rank']=Top15.Rank.astype(int)
Or
Top15['Rank']=Top15.Rank.astype(object)

Python Pandas: calculate rolling mean (moving average) over variable number of rows

Say I have the following dataframe
import pandas as pd
df = pd.DataFrame({ 'distance':[2.0, 3.0, 1.0, 4.0],
'velocity':[10.0, 20.0, 5.0, 40.0] })
gives the dataframe
distance velocity
0 2.0 10.0
1 3.0 20.0
2 1.0 5.0
3 4.0 40.0
How can I calculate the average of the velocity column over the rolling sum of the distance column? With the example above, create a rolling sum over the last N rows in order to get a minimum cumulative distance of 5, and then calculate the average velocity over those rows.
My target output would then be like this:
distance velocity rv
0 2.0 10.0 NaN
1 3.0 20.0 15.0
2 1.0 5.0 11.7
3 4.0 40.0 22.5
where
15.0 = (10+20)/2 (2 because 3 + 2 >= 5)
11.7 = (10 + 20 + 5)/3 (3 because 1 + 3 + 2 >= 5)
22.5 = (5 + 40)/2 (2 because 4 + 1 >= 5)
Update: in Pandas-speak, my code should find the index of the reverse cumulative distance sum back from my current record (such that it is 5 or greater), and then use that index to calculate the start of the moving average.
Not a particularly pandasy solution, but it sounds like you want to do something like
df['rv'] = np.nan
for i in range(len(df)):
j = i
s = 0
while j >= 0 and s < 5:
s += df['distance'].loc[j]
j -= 1
if s >= 5:
df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()
Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
a[i] = v[j+1:i+1].mean()
df['rv'] = a
Moreover, as suggested by #JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a #numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on
pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})
I get the following results:
Method Benchmark
-----------------------------------------------
Original data frame based 4.65 s ยฑ 325 ms
Pure numpy array based 80.8 ms ยฑ 9.95 ms
Jitted numpy array based 766 ยตs ยฑ 52 ยตs
Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with
#numba.jit
def numba_example():
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
for k in range(j+1, i+1):
a[i] += v[k]
a[i] /= (i-j)
df['rv'] = a
then the benchmark reduces to 158 ยตs ยฑ 8.41 ยตs.
Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)
How about
df.rolling(window=3, min_periods=2).mean()
distance velocity
0 NaN NaN
1 2.500000 15.000000
2 2.000000 11.666667
3 2.666667 21.666667
To combine them
df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()
It looks like something's a little off with the window shape.

Classifying Data in a New Column

I have following df:
Column 1
1
2435
3345
104
505
6005
10000
80000
100000
4000000
4440
520
...
This structure is not the best to plot a histogram, which is the main purpose. Bins don't really solve the problem either, at least from what I've tested so far. That's why I like to create my own bins in a new column:
I basically want to assign every value within a certain range in column 1 a bucket in column2, so that it look like this:
Column 1 Column2
1 < 10000
2435 < 10000
3345 < 10000
104 < 10000
505 < 10000
6005 < 10000
10000 < 50000
80000 < 150000
100000 < 150000
4000000 < 250000
4440 < 10000
520 < 10000
...
Once I get there, creating a plot will be much easier.
Thanks!
There is a pandas equivalent to this cut there is a section describing this here. cut returns the open closed intervals for each value:
In [29]:
df['bin'] = pd.cut(df['Column 1'], bins = [0,10000, 50000, 150000, 25000000])
df
Out[29]:
Column 1 bin
0 1 (0, 10000]
1 2435 (0, 10000]
2 3345 (0, 10000]
3 104 (0, 10000]
4 505 (0, 10000]
5 6005 (0, 10000]
6 10000 (0, 10000]
7 80000 (50000, 150000]
8 100000 (50000, 150000]
9 4000000 (150000, 25000000]
10 4440 (0, 10000]
11 520 (0, 10000]
The dtype of the column is a Category and can be used for filtering, counting, plotting etc.
numpy.histogram takes a bins parameter which can be an integer array, and returns an array of the counts within those bins. So, if you run
import numpy as np
counts, _ = np.histogram(df[`Column 1`].values, [10000, 50000, 150000, 250000])
You will have the bins you want. From here, you can do whatever you want, including plotting the number of counts within each bin:
plot(counts)

Categories

Resources