I have a huge dataframe, and I want to use several columns to apply a custom function, and put the result in a new column. But I have met a problem.
The following is my function to calculate the distance between two rows.
def calcDist(p, q):
diff = p - q
square_diff = diff ** 2
sum_square_diff = square_diff.sum()
return sum_square_diff ** 0.5
One of the parameters in the function is constant(a series with 0 and 1), the second parameter of the function is the data in the dataframe which in the selected columns(somthing like a series with 0 and 1).
I have tried the following codes.
cols = ['a','b','c']
new = [0,1,1]
df.columns = ['aa','a','b','c','dd','ee']
df['dist'] = df.loc[:,cols].apply(lamda x: calcdist(x, new))
But I get NaN in the 'dist' column.
I 've already tried for loop to solve this problem. But it works to slow.
house_chosen['dist'] = 0
for i in range(len(house_chosen)):
cols_chosen = house_chosen.loc[:, addition_list]
series_chosen = cols_chosen.iloc[i, :]
house_chosen.iloc[i, 42] = calcDist(new_house_addition, series_chosen)
So is there any way to solve the problem with apply function?
thx
Related
I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!
I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2+ columns.
Below is how I calculate the metric using regular cycle.
def mean_squared_error(aa, bb):
return np.sum((aa - bb) ** 2) / len(aa)
def rolling_metric(df_, col_a, col_b, window, metric_fn):
result = []
for i, id_ in enumerate(df_.index):
if i < (df_.shape[0] - window + 1):
slice_idx = df_.index[i: i+window-1]
slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
result.append(metric_fn(slice_a, slice_b))
else:
result.append(None)
return pd.Series(data = result, index = df_.index)
df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )
%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)
This takes close to a second for just 1000 rows.
Please suggest faster vectorized way to calculate such metric over sliding window.
In this specific case:
You can calculate the squared error beforehand and then use .Rolling.mean():
df['sq_error'] = (df['y_true'] - df['y_pred'])**2
%time df['sq_error'].rolling(6).mean().dropna()
Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.
You can even write it like this:
%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()
In general:
In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:
This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.
Therefore, your computation could be something like this:
WIN_LEN = 6
def mean_sq_err_table(arr, min_window=WIN_LEN):
if len(arr) < min_window:
return np.nan
else:
return np.mean((arr[:, 0] - arr[:, 1])**2)
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()
Because it uses numba, this is also relatively fast.
I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.
New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))
Here I have a function that compute a percentile column based on 2 other columns in the dataframe:
for each row, the function recreate a mini df with only the last 20 rows, compute the absolute difference for each of them, and then assign a percentile to the current row.
I was suggested by a respondent to a previous question to repost that more specific question regarding apply
grid = np.random.rand(40,2)
full = pd.DataFrame(grid, columns=['value'])
def percentile(x, df):
if int(x.name)<20:
pass
else:
df_temp = df.loc[(int(x.name)-20):int(x.name),'value']
bucketted = [b for b in df_temp.value if b < df_temp.loc[int(x.name), 'value']]
return len(bucketted)/0.2
full['percentile'] = full.apply(percentile, axis=1, args=(full,))
for intellectual curiosity - since this works - if anyone has a neater /faster way to approach the problem.