drops a column if it exceeds a specific number of NA values

drops a column if it exceeds a specific number of NA values - python

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks

Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.

I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Related

create new pandas column based on if and else rule

I have this dataframe and I want to create column e:
df
a b c d
1 2 1 2
Nan Nan 3 1
Nan Nan Nan 5
4 5 0 2
I want create a new column based on this criteria:
The highest of column a vs column b.
If no value in column a and column b , then look column c
if no value in column c, then look column d.
df
a b c d e
1 2 1 2 2
Nan Nan 3 1 3
Nan Nan Nan 5 5
4 5 0 2 5
my idea just until step number 2.
def e(x):
if x['a'] >= x['b']:
return x['a']
elif x['a'] <= x['b']:
return x['b']
else:
x['c']
df['e'] = df.apply(e, axis=1)

IIUC, use pandas.DataFrame.bfill:
df["e"] = df.bfill(1)[["a", "b"]].max(1)
print(df)
Output:
a b c d e
0 1 2 1 2 2.0
1 NaN NaN 3 1 3.0
2 NaN NaN NaN 5 5.0
3 4 5 0 2 5.0

You can always use np.where()
df['e'] = df['d']
df['e'] = np.where((df['a'].isna()) & (df['b'].isna()) & (df['c'].notnull()), df['c'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['a'] > df['b']), df['a'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['b'] > df['a']), df['b'], df['e'])
df

First get maximum a, b values and assign to a column, then back filling missing values and select first column for prioritize c and then d columns:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 1 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If want test only a,b,c,d columns and possible some another columns:
df['e'] = df[['a','b']].max(axis=1).fillna(df.c).fillna(df.d)
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If changed second row to 3,5 output is:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0 <- changed d=5
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0

Pandas fillna using groupby and mode

I recently started working with Pandas and I'm currently trying to impute some missing values in my dataset.
I want to impute the missing values based on the median (for numerical entries) and mode (for categorical entries). However, I do not want to calculate the median and mode over the whole dataset, but per-group, based on a GroupBy of my column called "make".
For numerical NA values I did the following:
data = data.fillna(data.groupby("make").transform("median"))
...which works perfectly and replaces all my numerical NA values with the median of their "make".
However, for categorical NA values, I couldn't manage to do the same thing for the mode, i.e. replace all categorical NA values with the mode of their "make".
Does anyone know how to do it?

You can use GroupBy.transform with if-else for median for numeric and mode for categorical columns:
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb'),
'make':list('aaabbb')
})
df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
A B C D F make
0 e NaN 7.0 1.0 a a
1 b NaN NaN 3.0 a a
2 NaN 4.0 9.0 5.0 NaN a
3 d 5.0 4.0 NaN b b
4 NaN 5.0 2.0 1.0 b b
5 d 4.0 3.0 0.0 NaN b
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('make').transform(f))
print (df)
A B C D F make
0 e 4 7 1 a a
1 b 4 7 3 a a
2 b 4 9 5 a a
3 d 5 4 0 b b
4 d 5 2 1 b b
5 d 4 3 0 b b

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?

New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g

Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

Pandas: General Data Imputation Based on Column Dtype

I'm working with a dataset with ~80 columns, many of which contain NaN. I definitely don't want to manually inspect dtype for each column and impute based on that.
So I wrote a function to impute a column's missing values based on its dtype:
def impute_df(df, col):
# if col is float, impute mean
if df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
But to use this, I'd have to loop over all columns in my DataFrame, something like:
for col in train_df.columns:
impute_df(train_df, col)
And I know looping in Pandas is generally slow. Is there a better way of going about this?
Thanks!

I think you need select_dtypes for numeric and non numeric columns and then apply fillna for filtered columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,5,4,5,5,4],
'C':[7,8,np.nan,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':['a','a','b','b','b',np.nan]})
print (df)
A B C D E F
0 a NaN 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 NaN 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 NaN
cols1 = df.select_dtypes([np.number]).columns
cols2 = df.select_dtypes(exclude = [np.number]).columns
df[cols1] = df[cols1].fillna(df[cols1].mean())
df[cols2] = df[cols2].fillna(df[cols2].mode().iloc[0])
print (df)
A B C D E F
0 a 4.6 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 4.8 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 b

I think you do not need a function here,
for example:
df=pd.DataFrame({'A':[1,np.nan,3,4],'A_1':[1,np.nan,3,4],'B':['A','A',np.nan,'B']})
v=df.select_dtypes(exclude=['object']).columns
t=~df.columns.isin(v)
df.loc[:,v]=df.loc[:,v].fillna(df.loc[:,v].mean().to_dict())
df.loc[:,t]=df.loc[:,t].fillna(df.loc[:,t].mode().iloc[0].to_dict())
df
Out[1440]:
A A_1 B
0 1.000000 1.000000 A
1 2.666667 2.666667 A
2 3.000000 3.000000 A
3 4.000000 4.000000 B

Pandas: Reshape two columns into one row

I want to reshape a pandas DataFrame from two columns into one row:
import numpy as np
import pandas as pd
df_a = pd.DataFrame({ 'Type': ['A', 'B', 'C', 'D', 'E'], 'Values':[2,4,7,9,3]})
df_a
Type Values
0 A 2
1 B 4
2 C 7
3 D 9
4 E 3
df_b = df_a.pivot(columns='Type', values='Values')
df_b
Which gives me this:
Type A B C D E
0 2.0 NaN NaN NaN NaN
1 NaN 4.0 NaN NaN NaN
2 NaN NaN 7.0 NaN NaN
3 NaN NaN NaN 9.0 NaN
4 NaN NaN NaN NaN 3.0
When I want it condensed into a single row like this:
Type A B C D E
0 2.0 4.0 7.0 9.0 3.0

I believe you dont need pivot, better is DataFrame constructor only:
df_b = pd.DataFrame([df_a['Values'].values], columns=df_a['Type'].values)
print (df_b)
A B C D E
0 2 4 7 9 3
Or set_index with transpose by T:
df_b = df_a.set_index('Type').T.rename({'Values':0})
print (df_b)
Type A B C D E
0 2 4 7 9 3

Another way:
df_a['col'] = 0
df_a.set_index(['col','Type'])['Values'].unstack().reset_index().drop('col', axis=1)
Type A B C D E
0 2 4 7 9 3

We can fix your df_b
df_b.ffill().iloc[[-1],:]
Out[360]:
Type A B C D E
4 2.0 4.0 7.0 9.0 3.0
Or we do
df_a.assign(key=[0]*len(df_a)).pivot(columns='Type', values='Values',index='key')
Out[366]:
Type A B C D E
key
0 2 4 7 9 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

drops a column if it exceeds a specific number of NA values - python

Alternatively, you can use count which counts non-null values In [23]: df.loc[:, df.count().gt(len(df.index) - 2)] Out[23]: A D E F 0 a 1 5.0 a 1 b 3 3.0 a 2 c 5 6.0 a 3 d 7 9.0 b 4 e 1 2.0 b 5 f 0 NaN b

Related

create new pandas column based on if and else rule

Pandas fillna using groupby and mode

Move Null rows to the bottom of the dataframe

Pandas: General Data Imputation Based on Column Dtype

Pandas: Reshape two columns into one row

Categories

Resources