Here I have a function that compute a percentile column based on 2 other columns in the dataframe:
for each row, the function recreate a mini df with only the last 20 rows, compute the absolute difference for each of them, and then assign a percentile to the current row.
I was suggested by a respondent to a previous question to repost that more specific question regarding apply
grid = np.random.rand(40,2)
full = pd.DataFrame(grid, columns=['value'])
def percentile(x, df):
if int(x.name)<20:
pass
else:
df_temp = df.loc[(int(x.name)-20):int(x.name),'value']
bucketted = [b for b in df_temp.value if b < df_temp.loc[int(x.name), 'value']]
return len(bucketted)/0.2
full['percentile'] = full.apply(percentile, axis=1, args=(full,))
for intellectual curiosity - since this works - if anyone has a neater /faster way to approach the problem.
Related
I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!
I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()
I have a huge dataframe, and I want to use several columns to apply a custom function, and put the result in a new column. But I have met a problem.
The following is my function to calculate the distance between two rows.
def calcDist(p, q):
diff = p - q
square_diff = diff ** 2
sum_square_diff = square_diff.sum()
return sum_square_diff ** 0.5
One of the parameters in the function is constant(a series with 0 and 1), the second parameter of the function is the data in the dataframe which in the selected columns(somthing like a series with 0 and 1).
I have tried the following codes.
cols = ['a','b','c']
new = [0,1,1]
df.columns = ['aa','a','b','c','dd','ee']
df['dist'] = df.loc[:,cols].apply(lamda x: calcdist(x, new))
But I get NaN in the 'dist' column.
I 've already tried for loop to solve this problem. But it works to slow.
house_chosen['dist'] = 0
for i in range(len(house_chosen)):
cols_chosen = house_chosen.loc[:, addition_list]
series_chosen = cols_chosen.iloc[i, :]
house_chosen.iloc[i, 42] = calcDist(new_house_addition, series_chosen)
So is there any way to solve the problem with apply function?
thx
I'm trying to understand how to go around this in python pandas. My objective is to fill column "RESULT" with the initial investment and apply the profit on top of the previous result.
So if I would use an excel spreadsheet I would do this:
Ask what's the initial_investment (in this example $350)
Compute the first row as profit/100*initial_investment + initial_investment
the 2nd and forth will be the same with the exception that "initial_investment" is in the raw above.
my initial python code is this
import pandas as pd
df = pd.DataFrame({"DATE":[2009,2010,2011,2012,2013,2014,2015,2016],"PROFIT":[10,4,5,7,-10,5,-5,3],"RESULT":[350,350,350,350,350,350,350,350]})
print df
You can use the cumulative product function cumprod():
df['RESULT'] = ((df.PROFIT + 100) / 100.).cumprod() * 350
First you transform df.PROFIT into a proportion of the previous value. Then cumprod() multiplies each row by the previous rows. You can then just multiply this by whatever your initial value is.
Suppose I have a DataFrame with columns person_id and mean_act, where every row is a numerical value for a specific person. I want to calculate the zscore for all the values at a person level. That is, I want a new column mean_act_person_zscore that is computed as the zscore of mean_act using the mean and std of the zscores for that person only (and not the whole dataset).
My first approach is something like this:
person_ids = df['person_id'].unique()
for pid in person_ids:
person_df = df[df['person_id'] == pid]
person_df = (person_df['mean_act'] - person_df['mean_act'].mean())/person_df['mean_act'].std()
At every iteration, it computes the right zscore output series, but the problem is that since the selection is by reference, not by value, the original df ends up without having the mean_act_person_zscore column.
Thoughts as to how to do this?
Should be straight forward:
df['mean_act_person_zscore'] = df.groupby('person_id').mean_act.transform(lambda x: (x - x.mean()) / x.std())