Access pandas dataframe column with two header pandas

Access pandas dataframe column with two header pandas - python

I created a dataframe using groupby and pd.cut to calculate the mean, std and number of elements inside a bin. I used the agg()and this is the command I used:
df_bin=df.groupby(pd.cut(df.In_X, ranges,include_lowest=True)).agg(['mean', 'std','size'])
df_bin looks like this:
X Y
mean std size mean std size
In_X
(10.424, 10.43] 10.425 NaN 1 0.003786 NaN 1
(10.43, 10.435] 10.4 NaN 0 NaN NaN 0
I want to create an array with the values of the mean for the first header X. If I didn't have the two header level, I would use something like:
mean=np.array(df_bin['mean'])
But how to do that with the two headers?

This documentation would serve you well: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
To answer your question, if you just want a particular column:
mean = np.array(df_bin['X', 'mean'])
But if you wanted to slice to the second level:
mean = np.array(df_bin.loc[:, (slice(None), 'mean')])
Or:
mean = np.array(df_bin.loc[:, pd.IndexSlice[:, 'mean']])

We can do
df_bin.stack(level=0)['mean'].values

Related

Pandas groupby and compute ratio of values with NA in multiple columns

I have a dataframe like as below
id,status,amount,qty
1,pass,123,4500
1,pass,156,3210
1,fail,687,2137
1,fail,456,1236
2,pass,216,324
2,pass,678,241
2,nan,637,213
2,pass,213,543
df = pd.read_clipboard(sep=',')
I would like to do the below
a) Groupby id and compute the pass percentage for each id
b) Groupby id and compute the average amount for each id
So, I tried the below
df['amt_avg'] = df.groupby('id')['amount'].mean()
df['pass_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
df['fail_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
but this doesn't work.
I am having trouble in getting the pass percentage.
In my real data I have lot of columns like status for which I have to find these % distribution of a specific value (ex: pass)
I expect my output to be like as below
id,pass_pct,fail_pct,amt_avg
1,50,50,2770.75
2,75,0,330.25

Use crosstab with replace missing values by nan with remove nan column and then add new column amt_avg by DataFrame.join:
s = df.groupby('id')['qty'].mean()
df = (pd.crosstab(df['id'], df['status'].fillna('nan'), normalize=0)
.drop('nan', 1)
.mul(100)
.join(s.rename('amt_avg')))
print (df)
fail pass amt_avg
id
1 50.0 50.0 2770.75
2 0.0 75.0 330.25

How do I fill NaN values with different random numbers on Python?

I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.
I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.
I have tried the following:
Replacing with numpy, replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
Fillna with pandas, also replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)
df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
Any ideas would be appreciated. Thanks in advance.

Series.fillna can accept a Series, so generate a random array of size len(df_travel):
rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()
filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)

I would go with it the following way:
# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()
# number of NaN in `Age` column
num_na = df['Age'].isna().sum()
# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean + age_std * np.random.randn(num_na)
# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals

Nan values not getting replaced

The values are getting replaced but the moment i print the data it still shows the nan values
for col in data.columns:
for each in range(len(data[col])):
if math.isnan(data[col][each]) == True:
data.replace(data[col][each], statistics.mean(data[col]))
data
dataset: https://docs.google.com/spreadsheets/d/1AVTVmUVs9lSe7I9EXoPs0gNaSIo9KM2PrXxwVWeqtME/edit?usp=sharing

Looks like what you are trying to do is to replace NaN values by the mean of each column, which has been treated here
Regarding your problem, the function replace(a,b) replaces all the values in your dataframe that are equal to a by b.
Moreover, the function statistics.mean will return NaN if there is a Nan number in your list, so you should use numpy.nanmean() instead.

working with NaN in a dataframe with if condition

I have 2 columns in a dataframe and I am trying to enter into a condition based on if the second one is NaN and First one has some values, unsuccessfully using:
if np.isfinite(train_bk['Product_Category_1']) and np.isnan(train_bk['Product_Category_2'])
and
if not (train_bk['Product_Category_2']).isnull() and (train_bk['Product_Category_3']).isnull()

I would use eval:
df.eval(' ind = ((pc1==pc1) & (pc2!=pc2) )*2+((pc1==pc1)&(pc2==pc2))*3')
df.replace({'ind':{0:1})

Computing the first non-missing value from each column in a DataFrame

I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.

You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)

The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()

By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access pandas dataframe column with two header pandas - python

We can do df_bin.stack(level=0)['mean'].values

Related

Pandas groupby and compute ratio of values with NA in multiple columns

How do I fill NaN values with different random numbers on Python?

Nan values not getting replaced

working with NaN in a dataframe with if condition

Computing the first non-missing value from each column in a DataFrame

Categories

Resources