Python Pandas: calculate rolling mean (moving average) over variable number of rows - python

Say I have the following dataframe
import pandas as pd
df = pd.DataFrame({ 'distance':[2.0, 3.0, 1.0, 4.0],
'velocity':[10.0, 20.0, 5.0, 40.0] })
gives the dataframe
distance velocity
0 2.0 10.0
1 3.0 20.0
2 1.0 5.0
3 4.0 40.0
How can I calculate the average of the velocity column over the rolling sum of the distance column? With the example above, create a rolling sum over the last N rows in order to get a minimum cumulative distance of 5, and then calculate the average velocity over those rows.
My target output would then be like this:
distance velocity rv
0 2.0 10.0 NaN
1 3.0 20.0 15.0
2 1.0 5.0 11.7
3 4.0 40.0 22.5
where
15.0 = (10+20)/2 (2 because 3 + 2 >= 5)
11.7 = (10 + 20 + 5)/3 (3 because 1 + 3 + 2 >= 5)
22.5 = (5 + 40)/2 (2 because 4 + 1 >= 5)
Update: in Pandas-speak, my code should find the index of the reverse cumulative distance sum back from my current record (such that it is 5 or greater), and then use that index to calculate the start of the moving average.

Not a particularly pandasy solution, but it sounds like you want to do something like
df['rv'] = np.nan
for i in range(len(df)):
j = i
s = 0
while j >= 0 and s < 5:
s += df['distance'].loc[j]
j -= 1
if s >= 5:
df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()
Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
a[i] = v[j+1:i+1].mean()
df['rv'] = a
Moreover, as suggested by #JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a #numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on
pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})
I get the following results:
Method Benchmark
-----------------------------------------------
Original data frame based 4.65 s ± 325 ms
Pure numpy array based 80.8 ms ± 9.95 ms
Jitted numpy array based 766 µs ± 52 µs
Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with
#numba.jit
def numba_example():
l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
j = i
s = 0
while j >= 0 and s < 5:
s += d[j]
j -= 1
if s >= 5:
for k in range(j+1, i+1):
a[i] += v[k]
a[i] /= (i-j)
df['rv'] = a
then the benchmark reduces to 158 µs ± 8.41 µs.
Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)

How about
df.rolling(window=3, min_periods=2).mean()
distance velocity
0 NaN NaN
1 2.500000 15.000000
2 2.000000 11.666667
3 2.666667 21.666667
To combine them
df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()
It looks like something's a little off with the window shape.

Related

Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)

I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.
I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
Is there faster (more optimized way) to calculate this?
I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.
Iterating over DataFrame rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas.
For you case, looking at your DataFrame column quantity as a numpy array, the code below should speed up the process quite a lot compared to your approach:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
Out :
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
If you still need more speed, suggest to have a look at numba as per jezrael answer.
Edit - Performance test
I got curious about performance and did this module with all 3 approaches.
I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
#njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
Running for a somewhat small array, size = 20_000, leads to:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here numpy arrays were 700+ times faster than iterrows(), and numba was still 22 times faster than numpy.
And for larger arrays, size = 200_000, we get:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Making numba again 25+ times faster than numpy arrays for this example, and confirming that you should avoid at all costs to use iterrows() for anything more than a couple of hundreds of rows.
I think numba is the best when working with loops if performance is important:
#njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3

Pandas - New column based on the value of another column N rows back, when N is stored in a column

I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new

Rolling average with window size an interval of column values

I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.

How do you apply a function on a dataframe column using data from previous rows?

I have a Dataframe which has three columns: nums with some values to work with, b which is always either 1 or 0 and the result column which is currently zero everywhere except in the first row (because we must have an initial value to work with).
The dataframe looks like this:
nums b result
0 20.0 1 20.0
1 22.0 0 0
2 30.0 1 0
3 29.1 1 0
4 20.0 0 0
...
The Problem
I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply.
The calculation I want to do is to take the value in nums and in result from the previous row, and if in the current row the b col is 0 then I want (for example) to add the num and the result from that previous row. If b in that row is 1 I'd like to substract them for example.
What have I tried?
I tried using apply but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.
I also tried using a loop like so, but it's too slow for the large filews I'm working with:
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
some_calc_func looks like this (just a general example):
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
Please answer with respect to some_calc_func
If you want to keep the function some_calc_func and not use another library, you should not try to access each element at each iteration, you can use zip on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append to a list instead of the dataframe, and after the loop assign the list to the column.
prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time,
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
# use your function to calculate the new prev_res
prev_res = some_calc_func (prev_res, prev_num, curren_b)
# add to the list of results
l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
Now with a dataframe df of 5000 rows, I got:
%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
prev_res = some_calc_func (prev_res, prev_num, curren_b)
l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and with your original solution, it was ~750x slower
%%timeit
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT with another library called numba, if the function some_calc_func can be easily used with Numba decorator.
from numba import jit
# decorate your function
#jit
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
# create a function to do your job
# numba likes numpy arrays
#jit
def with_numba(prev_res, arr_nums, arr_b):
# array for results and initialize
arr_res = np.zeros_like(arr_nums)
arr_res[0] = prev_res
# loop on the length of arr_b
for i in range(len(arr_b)):
#do the calculation and set the value in result array
prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
arr_res[i+1] = prev_res
return arr_res
Finally, call it like
df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size
%timeit df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note using Numba might be problematic depending on your actual some_calc_func
IIUC:
>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
).fillna(df.result).cumsum()
>>> df
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9
Explanation:
# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1 1
2 -1
3 -1
4 1
Name: b, dtype: int64
# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0 NaN
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0 20.0
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0 20.0
1 42.0
2 12.0
3 -17.1
4 2.9
dtype: float64
According to your requirement in comments, I can not think of a way without loops:
c1, c2 = 2, 1
l = [df.loc[0, 'result']] # store the first result in a list
# then loop over the series (df.b * df.nums)
for i, val in (df.b * df.nums).iteritems():
if i: # except for 0th index
if val == 0: # (df.b * df.nums) == 0 if df.b == 0
l.append(l[-1]) # append the last result
else: # otherwise apply the rule
t = l[-1] *c2 + val * c1
l.append(t)
>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]
>>> df['result'] = l
nums b result
0 20.0 1 20.0
1 22.0 0 20.0
2 30.0 1 80.0 # [ 20 * 1 + 30 * 2]
3 29.1 1 138.2 # [ 80 * 1 + 29.1 * 2]
4 20.0 0 138.2
Seems fast enough, did not test for large sample.
you have a f(...) to apply, but cannot because you need to keep a memory (of previous) row. You can do this either with a closure or a class. Below is a class implementation:
import pandas as pd
class Func():
def __init__(self, value):
self._prev = value
self._init = True
def __call__(self, x):
if self._init:
res = self._prev
self._init = False
elif x.b == 0:
res = x.nums - self._prev
else:
res = x.nums + self._prev
self._prev = res
return res
#df = pd.read_clipboard()
f = Func(20)
df['result'] = df.apply(f, axis=1)
You can replace the __call__ with whatever you want in some_calc_func body.
I realize this is what #Prodipta's answer was getting at, but this approach uses the global keyword instead to remember the previous result each iteration of apply:
prev_result = 20
def my_calc(row):
global prev_result
i = int(row.name) #the index of the current row
if i==0:
return prev_result
elif row['b'] == 1:
out = prev_result * df.loc[i-1,'nums']/2 #loc to get prev_num
else:
out = df.loc[i-1,'nums'] + 17
prev_result = out
return out
df['result'] = df.apply(my_calc, axis=1)
Result for your example data:
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
And here's a speed test a la #Ben T's answer - not the best but not the worst?
In[0]
df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})
prev_result = 20
%%timeit
df['result'] = df.apply(my_calc, axis=1)
Out[0]
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
re-using your loop and some_calc_func
I am using your loop and have reduced it to a bare minimum as below
for i in range(1, len(df)):
df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])
and the some_calc_func is implemented as below
def some_calc_func(bval, prev_result, curr_num):
if bval == 0:
return prev_result + curr_num
else:
return prev_result - curr_num
The result is as below
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9

avoid repetitive operations with Pandas

Derived from another question, here
I got a 2 million rows DataFrame, something similar to this
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
})
I need to run calculations with the values on each row and append the results as new columns, something similar to the question in this link.
I know that there a lot of combinations of speed, temp, and temp2 that are repeated if I drop_duplicates the resulting DataFrame is only 50k rows length, which takes significantly less time to process, using an apply function like this:
def dafunc(row):
row['r1'] = row['speed'] * row['temp1'] * k1
row['r2'] = row['speed'] * row['temp2'] * k2
nodup_df = final_df.drop_duplicates(['speed,','temp1','temp2'])
nodup_df = dodup_df.apply(dafunc,axis=1)
The above code is super simplified of what I actually do.
So far I'm trying to use a dictionary where I store the results and a string formed of the combinations is the key, if the dictionary already has those results, I get them instead of making the calculations again.
Is there a more efficient way to do this using Pandas' vectorized operations?
EDIT:
In the end, the resulting DataFrame should look like this:
#assuming k1 = 0.5, k2 = 1
resulting_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
'r1': [22.5,16,3.5,16,3.5,16],
'r2': [10,8,7,8,7,8],
})
Well if you can access the columns from a numpy array based on the column index it would be a lot faster i.e
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
If you want to create multiple columns at once you can use a for loop for that and speed will be similar like
k = [0.5,1]
for i in range(1,3):
final_df['r'+str(i)] = final_df.values[:,0]*final_df.values[:,i]*k[i-1]
If you drop duplicates it will be much faster.
Output:
speed temp temp2 ts r1 r2
0 5 9 2 0 22.5 10.0
1 4 8 2 1 16.0 8.0
2 1 7 7 2 3.5 7.0
3 4 8 2 3 16.0 8.0
4 1 7 7 4 3.5 7.0
5 4 8 2 5 16.0 8.0
For small dataframe
%%timeit
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
1000 loops, best of 3: 708 µs per loop
For large dataframe
%%timeit
ndf = pd.concat([final_df]*10000)
ndf['r1'] = ndf.values[:,0]*ndf.values[:,1]*k1
ndf['r2'] = ndf.values[:,0]*ndf.values[:,2]*k2
1 loop, best of 3: 6.19 ms per loop

Categories

Resources