I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2+ columns.
Below is how I calculate the metric using regular cycle.
def mean_squared_error(aa, bb):
return np.sum((aa - bb) ** 2) / len(aa)
def rolling_metric(df_, col_a, col_b, window, metric_fn):
result = []
for i, id_ in enumerate(df_.index):
if i < (df_.shape[0] - window + 1):
slice_idx = df_.index[i: i+window-1]
slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
result.append(metric_fn(slice_a, slice_b))
else:
result.append(None)
return pd.Series(data = result, index = df_.index)
df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )
%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)
This takes close to a second for just 1000 rows.
Please suggest faster vectorized way to calculate such metric over sliding window.
In this specific case:
You can calculate the squared error beforehand and then use .Rolling.mean():
df['sq_error'] = (df['y_true'] - df['y_pred'])**2
%time df['sq_error'].rolling(6).mean().dropna()
Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.
You can even write it like this:
%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()
In general:
In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:
This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.
Therefore, your computation could be something like this:
WIN_LEN = 6
def mean_sq_err_table(arr, min_window=WIN_LEN):
if len(arr) < min_window:
return np.nan
else:
return np.mean((arr[:, 0] - arr[:, 1])**2)
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()
Because it uses numba, this is also relatively fast.
Related
I have a groupby question that I can't solve. It is probably simple, but I can't get it to work nicely. I am trying to compute some statistics on a variable with pandas groupby chained with the very handy agg function. I would like add to the list below a calculation of the number of values above a given threshold.
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std])
Usually, I compute the number of values above a given threshold as shown below, but I can't find a way to add this to the aggregation function. Do you know how I could do that?
df =df[df>0].groupby(['scenario','Name','year','month']).count()
Your answer works. Else you could add it to the one line, not needing to create a separate function by using lambda x: instead.
df = df.groupby(["scenario", "Name", "year", "month"])["Value"].agg([np.min, np.max, np.mean, np.std, lambda x: ((x > 0)*1).sum()])
The logic here: (x > 0) returns True/False bool; *1 turns the bool to an integer (1 = True, 0 = False); .sum() will sum all the 1s and 0s within the group - and as those that are True = 1, the sum will count all values greater than 0.
Running a quick test on the time taken, your solution is faster, but I thought I would give an alternative solution anyway.
I found a solution by creating a function and passing it in the agg function.
def counta(x):
m = np.count_nonzero(x > 10)
return m
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std,counta])
I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.
I have a dataframe where the first row is the initial condition.
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4] + [np.nan]* 3})
and a function f(x,r) = r*x*(1-x), where r = 2 is a constant and 0 <= x <= 1.
I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4, 0.48, 4992, 0.49999872]})
Question: Is it possible to do this in a vectorized way?
I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.
I have also tried this, but all nan places are filled with 0.48.
df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])
It is IMPOSSIBLE to do this in a vectorized way.
By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.
However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.
def gen(x_init, n, R=2):
x = x_init
for _ in range(n):
x = R * x * (1-x)
yield x
# execute
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))
Result:
print(df)
Pop
0 0.400000
1 0.480000
2 0.499200
3 0.499999
It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.
pip install numba or conda install numba in the console first
import numba
Add decorator #numba.njit in front of the generator.
Change the number of np.nans to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.
I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()
I want to avoid apply() and Instead vectorize my data processing.
I have a function that buckets data based on few "if" and "else" conditions. How do I pass data to this function?
def my_function(id):
if 0 <= id <= 30000:
cal_score = 5
else:
cal_score = 0
return cal_score
Apply() works, it loops through every row
But, apply() is slow on a huge set of data. (My scenario)
df['final_score'] = df.apply(lambda x : my_function(x['id']), axis = 1)
Passing a numpy array does not work!!
df['final_score'] = my_function(df['id'].values)
ERROR : "truth value of an array with more than one element is ambiguous. Use a.any() or a.call()
Its not liking the entire array being passes as the "if" loop in my function errors out due to more than 1 element
I want to update my final_score column based on ID values but by passing an entire array.
how do I design or address this ?
Use Series.between to create your condition, multiply the resultant mask by 5.
df['final_score'] = df['id'].between(0, 30000, inclusive=True) * 5
It's easy:
Convert Series to numpy array via '.values'
n_a = df['final_score'].values
Vectorize your function
vfunc = np.vectorize(my_function)
Calculate the result array using vectorized function:
res_array = vfunc(n_a)
df['final_score'] = res_array
Check https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html for more details
Vectorized calculations over pd.Series converted to numpy array can be 10x times faster than using internal pandas calculations