I created a dataframe using groupby and pd.cut to calculate the mean, std and number of elements inside a bin. I used the agg()and this is the command I used:
df_bin=df.groupby(pd.cut(df.In_X, ranges,include_lowest=True)).agg(['mean', 'std','size'])
df_bin looks like this:
X Y
mean std size mean std size
In_X
(10.424, 10.43] 10.425 NaN 1 0.003786 NaN 1
(10.43, 10.435] 10.4 NaN 0 NaN NaN 0
I want to create an array with the values of the mean for the first header X. If I didn't have the two header level, I would use something like:
mean=np.array(df_bin['mean'])
But how to do that with the two headers?
This documentation would serve you well: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
To answer your question, if you just want a particular column:
mean = np.array(df_bin['X', 'mean'])
But if you wanted to slice to the second level:
mean = np.array(df_bin.loc[:, (slice(None), 'mean')])
Or:
mean = np.array(df_bin.loc[:, pd.IndexSlice[:, 'mean']])
We can do
df_bin.stack(level=0)['mean'].values
Related
I have a dataframe like as below
id,status,amount,qty
1,pass,123,4500
1,pass,156,3210
1,fail,687,2137
1,fail,456,1236
2,pass,216,324
2,pass,678,241
2,nan,637,213
2,pass,213,543
df = pd.read_clipboard(sep=',')
I would like to do the below
a) Groupby id and compute the pass percentage for each id
b) Groupby id and compute the average amount for each id
So, I tried the below
df['amt_avg'] = df.groupby('id')['amount'].mean()
df['pass_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
df['fail_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
but this doesn't work.
I am having trouble in getting the pass percentage.
In my real data I have lot of columns like status for which I have to find these % distribution of a specific value (ex: pass)
I expect my output to be like as below
id,pass_pct,fail_pct,amt_avg
1,50,50,2770.75
2,75,0,330.25
Use crosstab with replace missing values by nan with remove nan column and then add new column amt_avg by DataFrame.join:
s = df.groupby('id')['qty'].mean()
df = (pd.crosstab(df['id'], df['status'].fillna('nan'), normalize=0)
.drop('nan', 1)
.mul(100)
.join(s.rename('amt_avg')))
print (df)
fail pass amt_avg
id
1 50.0 50.0 2770.75
2 0.0 75.0 330.25
I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.
I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.
I have tried the following:
Replacing with numpy, replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
Fillna with pandas, also replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)
df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
Any ideas would be appreciated. Thanks in advance.
Series.fillna can accept a Series, so generate a random array of size len(df_travel):
rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()
filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)
I would go with it the following way:
# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()
# number of NaN in `Age` column
num_na = df['Age'].isna().sum()
# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean + age_std * np.random.randn(num_na)
# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals
The values are getting replaced but the moment i print the data it still shows the nan values
for col in data.columns:
for each in range(len(data[col])):
if math.isnan(data[col][each]) == True:
data.replace(data[col][each], statistics.mean(data[col]))
data
dataset: https://docs.google.com/spreadsheets/d/1AVTVmUVs9lSe7I9EXoPs0gNaSIo9KM2PrXxwVWeqtME/edit?usp=sharing
Looks like what you are trying to do is to replace NaN values by the mean of each column, which has been treated here
Regarding your problem, the function replace(a,b) replaces all the values in your dataframe that are equal to a by b.
Moreover, the function statistics.mean will return NaN if there is a Nan number in your list, so you should use numpy.nanmean() instead.
I have 2 columns in a dataframe and I am trying to enter into a condition based on if the second one is NaN and First one has some values, unsuccessfully using:
if np.isfinite(train_bk['Product_Category_1']) and np.isnan(train_bk['Product_Category_2'])
and
if not (train_bk['Product_Category_2']).isnull() and (train_bk['Product_Category_3']).isnull()
I would use eval:
df.eval(' ind = ((pc1==pc1) & (pc2!=pc2) )*2+((pc1==pc1)&(pc2==pc2))*3')
df.replace({'ind':{0:1})
I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.
You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)
The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()
By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.