While using numba, axis=0 is acceptable parameters for np.sum(), but not with np.diff(). Why is this happening? I'm working with 2D, thus axis specification is needed.
#jit(nopython=True)
def jitsum(y):
np.sum(y, axis=0)
#jit(nopython=True)
def jitdiff(y): #this one will cause error
np.diff(y, axis=0)
Error: np_diff_impl() got an unexpected keyword argument 'axis'
A workaround in 2D will be:
#jit(nopython=True)
def jitdiff(y):
np.diff(y.T).T
np.diff on a 2D array with n=1, axis=1 is just
a[:, 1:] - a[:, :-1]
For axis=0:
a[1:, :] - a[:-1, :]
I suspect that the lines above will compile just fine with numba.
def sum(y):
a=np.sum(y, axis=0)
b=np.sum(y,axis=1)
print("Sum along the rows (axis=0):",a)
print("Sum along the columns (axis=1):",b)
def diff_order1(y):
a=np.diff(y,axis=0,n=1)
b=np.diff(y,axis=1,n=1) ## n=1 indicates 1st order difference
print("1st order difference along the rows (axis=0):",a)
print("1st order difference along the columns (axis=1):",b)
def diff_order2(y):
a=np.diff(y,axis=0,n=2)
b=np.diff(y,axis=1,n=2) ## n=2 indicates 2nd order difference
print("2nd order difference along the rows (axis=0):",a)
print("2nd order difference along the columns (axis=1):",b)
This function is just another version of solving the problem calling the .diff function twice for order 2 difference
def diff_order2_v2(y):
a=np.diff(np.diff(y,axis=1),axis=1)
b=np.diff(np.diff(y,axis=0),axis=0)
print("2nd order difference along the rows (axis=0):",a)
print("2nd order difference along the columns (axis=1):",b)
Try running this code, I tried to create functions for sum function and difference function for 1st order and 2nd order difference.
Related
I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2+ columns.
Below is how I calculate the metric using regular cycle.
def mean_squared_error(aa, bb):
return np.sum((aa - bb) ** 2) / len(aa)
def rolling_metric(df_, col_a, col_b, window, metric_fn):
result = []
for i, id_ in enumerate(df_.index):
if i < (df_.shape[0] - window + 1):
slice_idx = df_.index[i: i+window-1]
slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
result.append(metric_fn(slice_a, slice_b))
else:
result.append(None)
return pd.Series(data = result, index = df_.index)
df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )
%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)
This takes close to a second for just 1000 rows.
Please suggest faster vectorized way to calculate such metric over sliding window.
In this specific case:
You can calculate the squared error beforehand and then use .Rolling.mean():
df['sq_error'] = (df['y_true'] - df['y_pred'])**2
%time df['sq_error'].rolling(6).mean().dropna()
Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.
You can even write it like this:
%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()
In general:
In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:
This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.
Therefore, your computation could be something like this:
WIN_LEN = 6
def mean_sq_err_table(arr, min_window=WIN_LEN):
if len(arr) < min_window:
return np.nan
else:
return np.mean((arr[:, 0] - arr[:, 1])**2)
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()
Because it uses numba, this is also relatively fast.
My question pertains to array iteration but is a bit more complicated. You see I have an array with a shape of (4, 50). What I want to do is find the mean of the arrays. I will show a simple explanation of what I mean
A = np.array([[10,5,3],[12,6,6],[9,8,7],[20,3,4]])
When this code is run, you get an array with a shape of (4,3). What I want is for the mean of each row to be found and returned.
Returned should be an array of ([[6],[8],[8],[9]]) with the same rows and naturally a column of 1.
Please explain the code and thought process behind it. Thank you very much.
Use the numpy.mean function. Parameter axis=1 means that the row-wise mean will be calculated. Parameter keepdims=True means that original array dimensions are kept.
import numpy as np
A = np.array([[10,5,3],[12,6,6],[9,8,7],[20,3,4]])
B = np.mean(A, axis=1, keepdims=True)
print(B)
# Output:
# [[6.]
# [8.]
# [8.]
# [9.]]
Use np.mean and list comprehension into a new array:
A = np.array([[10,5,3],[12,6,6], [9,8,7],[20,3,4]])
# Use .reshape() to get 4 rows by 1 column.
new_A = np.array([np.mean(row) for row in A]).reshape(-1, 1)
Output:
array([[6.], [8.], [8.], [9.]])
I want to avoid apply() and Instead vectorize my data processing.
I have a function that buckets data based on few "if" and "else" conditions. How do I pass data to this function?
def my_function(id):
if 0 <= id <= 30000:
cal_score = 5
else:
cal_score = 0
return cal_score
Apply() works, it loops through every row
But, apply() is slow on a huge set of data. (My scenario)
df['final_score'] = df.apply(lambda x : my_function(x['id']), axis = 1)
Passing a numpy array does not work!!
df['final_score'] = my_function(df['id'].values)
ERROR : "truth value of an array with more than one element is ambiguous. Use a.any() or a.call()
Its not liking the entire array being passes as the "if" loop in my function errors out due to more than 1 element
I want to update my final_score column based on ID values but by passing an entire array.
how do I design or address this ?
Use Series.between to create your condition, multiply the resultant mask by 5.
df['final_score'] = df['id'].between(0, 30000, inclusive=True) * 5
It's easy:
Convert Series to numpy array via '.values'
n_a = df['final_score'].values
Vectorize your function
vfunc = np.vectorize(my_function)
Calculate the result array using vectorized function:
res_array = vfunc(n_a)
df['final_score'] = res_array
Check https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html for more details
Vectorized calculations over pd.Series converted to numpy array can be 10x times faster than using internal pandas calculations
I need to resample some data with numpys weighted-average-function - and it just doesn't work... .
This is my test-case:
import numpy as np
import pandas as pd
time_vec = [datetime.datetime(2007,1,1,0,0)
,datetime.datetime(2007,1,1,0,1)
,datetime.datetime(2007,1,1,0,5)
,datetime.datetime(2007,1,1,0,8)
,datetime.datetime(2007,1,1,0,10)
]
df = pd.DataFrame([2,3,1,7,4],index = time_vec)
A normal resampling without weights works fine (using the lambda function as a parameter to how is suggested here: Pandas resampling using numpy percentile? Thanks!):
df.resample('5min',how = lambda x: np.average(x[0]))
But if i try to use some weights, it always returns a TypeError: Axis must be specified when shapes of a and weights differ:
df.resample('5min',how = lambda x: np.average(x[0],weights = [1,2,3,4,5]))
I tried this with many different numbers of weights, but it did not get better:
for i in xrange(20):
try:
print range(i)
print df.resample('5min',how = lambda x:np.average(x[0],weights = range(i)))
print i
break
except TypeError:
print i,'typeError'
I'd be glad about any suggestions.
The short answer here is that the weights in your lambda need to be created dynamically based on the length of the series that is being averaged. In addition, you need to be careful about the types of objects that you're manipulating.
The code that I got to compute what I think you're trying to do is as follows:
df.resample('5min', how=lambda x: np.average(x, weights=1+np.arange(len(x))))
There are two differences compared with the line that was giving you problems:
x[0] is now just x. The x object in the lambda is a pd.Series, and so x[0] gives just the first value in the series. This was working without raising an exception in the first example (without the weights) because np.average(c) just returns c when c is a scalar. But I think it was actually computing incorrect averages even in that case, because each of the sampled subsets was just returning its first value as the "average".
The weights are created dynamically based on the length of data in the Series being resampled. You need to do this because the x in your lambda might be a Series of different length for each time interval being computed.
The way I figured this out was through some simple type debugging, by replacing the lambda with a proper function definition:
def avg(x):
print(type(x), x.shape, type(x[0]))
return np.average(x, weights=np.arange(1, 1+len(x)))
df.resample('5Min', how=avg)
This let me have a look at what was happening with the x variable. Hope that helps!
I want to compute the integral image. for example
a=array([(1,2,3),(4,5,6)])
b = a.cumsum(axis=0)
This will generate another array b.Can I execute the cumsum in-place. If not . Are there any other methods to do that
You have to pass the argument out:
np.cumsum(a, axis=1, out=a)
OBS: your array is actually a 2-D array, so you can use axis=0 to sum along the rows and axis=1 to sum along the columns.
Try this using numpy directly numpy.cumsum(a) :
a=array([(1,2,3)])
b = np.cumsum(a)
print b
>>array([1,3,6])