I want to avoid apply() and Instead vectorize my data processing.
I have a function that buckets data based on few "if" and "else" conditions. How do I pass data to this function?
def my_function(id):
if 0 <= id <= 30000:
cal_score = 5
else:
cal_score = 0
return cal_score
Apply() works, it loops through every row
But, apply() is slow on a huge set of data. (My scenario)
df['final_score'] = df.apply(lambda x : my_function(x['id']), axis = 1)
Passing a numpy array does not work!!
df['final_score'] = my_function(df['id'].values)
ERROR : "truth value of an array with more than one element is ambiguous. Use a.any() or a.call()
Its not liking the entire array being passes as the "if" loop in my function errors out due to more than 1 element
I want to update my final_score column based on ID values but by passing an entire array.
how do I design or address this ?
Use Series.between to create your condition, multiply the resultant mask by 5.
df['final_score'] = df['id'].between(0, 30000, inclusive=True) * 5
It's easy:
Convert Series to numpy array via '.values'
n_a = df['final_score'].values
Vectorize your function
vfunc = np.vectorize(my_function)
Calculate the result array using vectorized function:
res_array = vfunc(n_a)
df['final_score'] = res_array
Check https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html for more details
Vectorized calculations over pd.Series converted to numpy array can be 10x times faster than using internal pandas calculations
Related
I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2+ columns.
Below is how I calculate the metric using regular cycle.
def mean_squared_error(aa, bb):
return np.sum((aa - bb) ** 2) / len(aa)
def rolling_metric(df_, col_a, col_b, window, metric_fn):
result = []
for i, id_ in enumerate(df_.index):
if i < (df_.shape[0] - window + 1):
slice_idx = df_.index[i: i+window-1]
slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
result.append(metric_fn(slice_a, slice_b))
else:
result.append(None)
return pd.Series(data = result, index = df_.index)
df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )
%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)
This takes close to a second for just 1000 rows.
Please suggest faster vectorized way to calculate such metric over sliding window.
In this specific case:
You can calculate the squared error beforehand and then use .Rolling.mean():
df['sq_error'] = (df['y_true'] - df['y_pred'])**2
%time df['sq_error'].rolling(6).mean().dropna()
Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.
You can even write it like this:
%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()
In general:
In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:
This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.
Therefore, your computation could be something like this:
WIN_LEN = 6
def mean_sq_err_table(arr, min_window=WIN_LEN):
if len(arr) < min_window:
return np.nan
else:
return np.mean((arr[:, 0] - arr[:, 1])**2)
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()
Because it uses numba, this is also relatively fast.
I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.
I have approximately 100 numpy arrays. Each of them is having shape of (100, 40000, 4). I want to concatenate these arrays along first axis, i.e., axis=0 into one big array efficiently.
Approach 1
I used np.concatenate as shown below-
def concatenate(all_data):
for index, data in enumerate(all_data):
if index == 0:
arr = data.copy()
else:
arr = np.concatenate((arr, data), axis=0)
return arr
Approach 2
I created panel in pandas and then used pd.concat as shown below-
def concatenate(all_data):
for index, data in enumerate(all_data):
if index == 0:
pn = pd.Panel(data)
else:
pn = pd.concat([pn, pd.Panel(data)])
return pn # numpy array can be acquired from pn.values
The second approach seems faster than first one. However, this approach shows deprecated warning while creating pd.Panel.
I want to know if there exists better way to concatenate large 3-dimensional arrays in python.
Calling np.concatenate() repeatedly is an anti-pattern. Instead, try this:
np.concatenate(all_data)
Simple, fast.
I have a numpy array let's say that has a shape (10,10) for example.
Now i want to apply np.exp() to this array, but just to some specific elements that satisfy a condition. For example i want to apply np.exp to all the elements that are not 0 or 1. Is there a way to do that without using for loop that iterate on each element of the array?
This is achievable with basic numpy operations. Here is a way to do that :
A = np.random.randint(0,5,size=(10,10)).astype(float) # data
goods = (A!=0) & (A!=1) # 10 x 10 boolean array
A[goods] = np.exp(A[goods]) # boolean indexing
I have a 2 dimensional array in numpy and need to apply a mathematical formula just to some values of the array which match certain criteria. This can be made using a for loop and if conditions however I think using numpy where() method works faster.
My code so far is this but it doesn't work
cond2 = np.where((SPN >= -alpha) & (SPN <= 0))
SPN[cond2] = -1*math.cos((SPN[cond2]*math.pi)/(2*alpha))
The values in the orginal array need to be replaced with the corresponding value after applying the formula.
Any ideas of how to make this work? I'm working with big arrays so need and efficient way of doing it.
Thanks
Try this:
cond2 = (SPN >= -alpha) & (SPN <= 0)
SPN[cond2] = -np.cos(SPN[cond2]*np.pi/(2*alpha))