I recently started working with Pandas and I'm currently trying to impute some missing values in my dataset.
I want to impute the missing values based on the median (for numerical entries) and mode (for categorical entries). However, I do not want to calculate the median and mode over the whole dataset, but per-group, based on a GroupBy of my column called "make".
For numerical NA values I did the following:
data = data.fillna(data.groupby("make").transform("median"))
...which works perfectly and replaces all my numerical NA values with the median of their "make".
However, for categorical NA values, I couldn't manage to do the same thing for the mode, i.e. replace all categorical NA values with the mode of their "make".
Does anyone know how to do it?
You can use GroupBy.transform with if-else for median for numeric and mode for categorical columns:
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb'),
'make':list('aaabbb')
})
df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
A B C D F make
0 e NaN 7.0 1.0 a a
1 b NaN NaN 3.0 a a
2 NaN 4.0 9.0 5.0 NaN a
3 d 5.0 4.0 NaN b b
4 NaN 5.0 2.0 1.0 b b
5 d 4.0 3.0 0.0 NaN b
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('make').transform(f))
print (df)
A B C D F make
0 e 4 7 1 a a
1 b 4 7 3 a a
2 b 4 9 5 a a
3 d 5 4 0 b b
4 d 5 2 1 b b
5 d 4 3 0 b b
Related
Given two grouped dataframes (df_train & df_test), how do I fill the missing values of df_test using values derived from df_train? For this example, I used median.
df_train=pd.DataFrame({'col_1':['A','B','A','A','C','B','B','A','A','B','C'], 'col_2':[float('NaN'),2,1,3,1,float('NaN'),2,3,2,float('NaN'),1]})
df_test=pd.DataFrame({'col_1':['A','A','A','A','B','C','C','B','B','B','C'], 'col_2':[3,float('NaN'),1,2,2,float('NaN'),float('NaN'),3,2,float('NaN'),float('NaN')]})
# These are the median values derived from df_train which I would like to impute into df_test based on the column col_1.
values_used_in_df_train = df_train.groupby(by='col_1')['col_2'].median()
values_used_in_df_train
col_1
A 2.5
B 2.0
C 1.0
Name: col_2, dtype: float64
# For df_train, I can simply do the following:
df_train.groupby('col_1')['col_2'].transform(lambda x : x.fillna(x.median()))
I tried df_test.groupby('col_1')['col_2'].transform(lambda x : x.fillna(values_used_in_df_train)) which does not work.
So I want:
df_test
col_1 col_2
0 A 3.0
1 A NaN
2 A 1.0
3 A 2.0
4 B 2.0
5 C NaN
6 C NaN
7 B 3.0
8 B 2.0
9 B NaN
10 C NaN
to become
df_test
col_1 col_2
0 A 3.0
1 A 2.5
2 A 1.0
3 A 2.0
4 B 2.0
5 C 1.0
6 C 1.0
7 B 3.0
8 B 2.0
9 B 2.0
10 C 1.0
Below here are just my thoughts, you do not have to consider them since it might be irrelevant/confusing.
I guess I could use an if-else method to match row-by-row to the index of values_used_in_df_train but I am trying to achieve this within groupby.
Try merging df_test and values_used_in_df_train:
df_test=df_test.merge(values_used_in_df_train.reset_index(),on='col_1',how='left',suffixes=('','_y'))
Finally fill missing values by using fillna():
df_test['col_2']=df_test['col_2'].fillna(df_test.pop('col_2_y'))
OR
Another way(If order is not important):
append df_test and values_used_in_df_train and then drop NaN's:
df_test=(df_test.append(values_used_in_df_train.reset_index())
.dropna(subset=['col_2'])
.reset_index(drop=True))
Without using apply (because dataframe is too big), how I can get the previous not NaN value of a specific column to use in a calc ?
For example, this dataframe:
df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
A B C
0 A 1 100.0
1 B 2 NaN
2 C 3 NaN
3 D 4 182.0
4 E 5 NaN
I need to calc the difference, in the column 'C' of the line 3 with the line 0.
The number of NaN values between the values is variable, then .shift() maybe is not applicable here (I think)
I need some like: df['D'] = df.C - df.C[previous_not_nan] (in the line 3 will be 82.
dropna + diff
df['D'] = df['C'].dropna().diff()
A B C D
0 A 1 100.0 NaN
1 B 2 NaN NaN
2 C 3 NaN NaN
3 D 4 182.0 82.0
4 E 5 NaN NaN
I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)
i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I'm working with a dataset with ~80 columns, many of which contain NaN. I definitely don't want to manually inspect dtype for each column and impute based on that.
So I wrote a function to impute a column's missing values based on its dtype:
def impute_df(df, col):
# if col is float, impute mean
if df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
But to use this, I'd have to loop over all columns in my DataFrame, something like:
for col in train_df.columns:
impute_df(train_df, col)
And I know looping in Pandas is generally slow. Is there a better way of going about this?
Thanks!
I think you need select_dtypes for numeric and non numeric columns and then apply fillna for filtered columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,5,4,5,5,4],
'C':[7,8,np.nan,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':['a','a','b','b','b',np.nan]})
print (df)
A B C D E F
0 a NaN 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 NaN 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 NaN
cols1 = df.select_dtypes([np.number]).columns
cols2 = df.select_dtypes(exclude = [np.number]).columns
df[cols1] = df[cols1].fillna(df[cols1].mean())
df[cols2] = df[cols2].fillna(df[cols2].mode().iloc[0])
print (df)
A B C D E F
0 a 4.6 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 4.8 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 b
I think you do not need a function here,
for example:
df=pd.DataFrame({'A':[1,np.nan,3,4],'A_1':[1,np.nan,3,4],'B':['A','A',np.nan,'B']})
v=df.select_dtypes(exclude=['object']).columns
t=~df.columns.isin(v)
df.loc[:,v]=df.loc[:,v].fillna(df.loc[:,v].mean().to_dict())
df.loc[:,t]=df.loc[:,t].fillna(df.loc[:,t].mode().iloc[0].to_dict())
df
Out[1440]:
A A_1 B
0 1.000000 1.000000 A
1 2.666667 2.666667 A
2 3.000000 3.000000 A
3 4.000000 4.000000 B