Python - Take weighted average inside Pandas groupby while ignoring NaN - python

I need to group a Pandas dataframe by date, and then take a weighted average of given values. Here's how it's currently done using the margin value as an example (and it works perfectly until there are NaN values):
df = orders.copy()
# Create new columns as required
df['margin_WA'] = df['net_margin'].astype(float) # original data as str or Decimal
def group_wa():
return lambda num: np.average(num, weights=df.loc[num.index, 'order_amount'])
agg_func = {
'margin_WA': group_wa(), # agg_func includes WAs for other elements
}
result = df.groupby('order_date').agg(agg_func)
result['margin_WA'] = result['margin_WA'].astype(str)
In the case where 'net_margin' fields contain NaN values, the WA is set to NaN. I can't seem to be able to dropna() or filtering by pd.notnull when creating new columns, and I don't know where to create a masked array to avoid passing NaN to the group_wa function (like suggested here). How do I ignore NaN in this case?

I think a simple solution is to drop the missing values before you groupby/aggregate like:
result = df.dropna(subset='margin_WA').groupby('order_date').agg(agg_func)
In this case, no indices containing missings are passed to your group_wa function.
Edit
Another approach is to move the dropna into your aggregating function like:
def group_wa(series):
dropped = series.dropna()
return np.average(dropped, weights=df.loc[dropped.index, 'order_amount'])
agg_func = {'margin_WA': group_wa}
result = df.groupby('order_date').agg(agg_func)

Related

How do I fill NaN values with different random numbers on Python?

I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.
I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.
I have tried the following:
Replacing with numpy, replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
Fillna with pandas, also replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)
df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
Any ideas would be appreciated. Thanks in advance.
Series.fillna can accept a Series, so generate a random array of size len(df_travel):
rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()
filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)
I would go with it the following way:
# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()
# number of NaN in `Age` column
num_na = df['Age'].isna().sum()
# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean + age_std * np.random.randn(num_na)
# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

Replace NaN values from one column with different length into other column with additional condition

I am working with Titanic data set. This set have 891 rows. At moment I am focus on column 'Age'.
import pandas as pd
import numpy as np
import os
titanic_df = pd.read_csv('titanic_data.csv')
titanic_df ['Age']
Column 'Age' have 177 Nan values, so I want to replace this values from values from my sample. I already made sample for this column and you can see code below.
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(177)
So next steep should be replacing Nan value from age_sample into titanic_df ['Age']. In order to do this I try with this lines of code.
titanic_df ['Age']=age_sample
titanic_df ['Age'].isna()=age_sample
But obliviously here I made some mistakes. So can anybody help me how to replace value from sample (177 rows) into original data set (891 rows) and replace only Nan values.
A two line solution:
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(177))
If number of NaN values is not known:
nan_len = len(df['Age'][df['Age'].isna()])
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(nan_len ))
You need to select the subframe you want to update using loc:
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
I will divide my answer to two parts. Solution you are looking for and solution that makes it more robust.
Solution you are looking for
We have to find the number of missing values first, then generate number of sample matching our missing value and then assign. This will insure that you have the same size of needed missing values.
...
age_na_size = titanic_df ['Age'].isna().sum()
# generate sample of that sum
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(age_na_size)
# feed that to missing values
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
Solutions to make it robust
find the group mean age and replace missing values according. Example group by gender, carbin etc features that makes sense and use median age as a replacer.
Use k-Nearest Neighbour as age replacer. See scikit-learn
knnimputer
Use bins of age instead of actual ages. In this way you can first create a classifier to predict the age bin then use that as your code imputer.

Define a function in python to make the NaN values zero with variable number of arguments

I am trying to define a function which can fill zero in all the Nan values for all the columns of the input dataframe.
The function takes 1 fixed argument (dataframe name) and variable number of arguments for column names in the dataframe.
I am using below code -
def makeNa(df,*colnames):
for col in colnames:
return df.fillna({col:0},inplace=True)
The function works when I pass only one column name e.g. makeNa(df_test,'USA') -it makes all the Nan values as zero for column 'USA'
but the function doesn't work while inputting more than one column names e.g. makeNa(df_test,'USA','Japan') --> it doesnt zero the Nan values in column 'Japan'
You don't let the loop end, you return at the first iteration. You could do something like:
def makeNa(df, *colnames):
df[colnames] = df[colnames].filna(0)
return df
def makeNa(df,*colnames):
for col in colnames:
df.fillna({col:0},inplace=True)
return df
you are returning in for loop , instead try to return outside.

Adding goupby transform result to an existing pandas DataFrame with each row representing a group

TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})

Categories

Resources