replacing NaN values in dataframe with pandas - python

I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.
I have managed to do it with following code:
def exercise4(df):
df1 = df.select_dtypes(np.number)
df2 = df.select_dtypes(exclude = 'float')
mode = df2.mode()
df3 = df1.fillna(df.mean())
df4 = df2.fillna(mode.iloc[0,:])
new_df = [df3,df4]
df5 = pd.concat(new_df,axis=1)
new_cols = list(df.columns)
df6 = df5[new_cols]
return df6
But i am sure there is a far easier method to do this?

You can use:
df = pd.DataFrame({
'A':list('abcdec'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan
print (df)
A B C D E
0 a 4.0 7.0 1 b
1 NaN NaN NaN 3 NaN
2 c 4.0 9.0 5 c
3 NaN NaN NaN 7 NaN
4 e 5.0 2.0 1 e
5 c 4.0 3.0 0 b
Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna:
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A c
E b
B 4.25
C 5.25
D 2.83333
dtype: object
df.fillna(both, inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function with DataFrame.pipe:
def exercise4(df):
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
df.fillna(both, inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Another idea is use DataFrame.apply, but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype:
from pandas.api.types import is_numeric_dtype
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function:
from pandas.api.types import is_numeric_dtype
def exercise4(df):
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)

You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):
numerical_col = df._get_numeric_data().columns
At this point you only need one line of code using an apply function that runs through the columns:
fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)

Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.
Looking at these two lines for example:
mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])
You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace, meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?
My suggestion:
def mean_mode(df):
df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
return df

You can work as follows:
df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else x.fillna(x.mean()) )

Related

finding the difference between two different data frame by using pd [duplicate]

How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?
If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2
The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.
In this case we use DataFrame.merge & DataFrame.query:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h

Computing correlation of a matrix with its transpose

I am trying to compute the correlation of a matrix(here, rows of dataframe) with its transpose using apply.
The following is the code:
import pandas as pd
from pprint import pprint
d = {'A': [1,0,3,0], 'B':[2,0,1,0], 'C':[0,0,8,0], 'D':[1,0,0,1]}
df = pd.DataFrame(data=d)
df_T = df.T
corr = df.apply(lambda s: df_T.corrwith(s))
All the columns of correlation variable contains NaN entries. I'd
like to understand why NaN occurs.
Could someone explain?
I think you need DataFrame.corr:
print (df.corr())
A B C D
A 1.000000 0.492366 0.942809 -0.408248
B 0.492366 1.000000 0.174078 0.301511
C 0.942809 0.174078 1.000000 -0.577350
D -0.408248 0.301511 -0.577350 1.000000
If need your solution is necessary same index and columns values:
df = pd.DataFrame(data=d).set_index(df.columns)
print (df)
A B C D
A 1 2 0 1
B 0 0 0 0
C 3 1 8 0
D 0 0 0 1
df_T = df.T
corr = df.apply(lambda s: df_T.corrwith(s))
print (corr)
A B C D
A -0.866025 -0.426401 -0.816497 0.000000
B NaN NaN NaN NaN
C 0.993399 0.489116 0.936586 -0.486664
D -0.471405 -0.522233 -0.333333 0.577350

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

python isnull().sum() handle headers

I have a dataset in which I want to count the missing values for each column. If there are missing values, I want to print the header name. I use the following code in order to find the missing values per column
isnull().sum()
If I print the result everything is OK, if I try to put the result in a list and then handle the headers, I can't!
newList = pd.isnull(myData).sum()
print(newList)
In this case the output is:
Name 5
Surname 0
Age 3
and I want to print only Surname but I can't find how to return it to a variable.
newList = pd.isnull(myData).sum()
print(newList[0])
This print 5 (the number of missing values for column 'Name')
Use boolean indexing with Series:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[np.nan,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 NaN 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 NaN 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
newList = df.isnull().sum()
print (newList)
A 0
B 0
C 1
D 1
E 0
F 0
dtype: int64
#for return NaNs columns
print(newList.index[newList != 0].tolist())
['C', 'D']
#for return non NaNs columns
print(newList.index[newList == 0].tolist())
['A', 'B', 'E', 'F']

pandas: Merge two columns with different names?

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.
The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do
df_total = pandas.concat ([df0, df1])
the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.
Can I somehow merge the two columns into one? I would like to have a function that does something like:
df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]
which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?
(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)
I think you can rename column first for align data in both DataFrames:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}
df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
B C MyObjectClasses
0 4 7 1
1 5 8 2
2 6 9 3
3 4 7 1
4 5 8 2
5 6 9 3
EDIT:
More simplier is update (working inplace):
df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
B C ObjectClass ObjectType
0 4 7 NaN 1.0
1 5 8 NaN 2.0
2 6 9 NaN 3.0
0 4 7 1.0 1.0
1 5 8 2.0 2.0
2 6 9 3.0 3.0
Or fillna, but then need drop original columns columns:
df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
B C ObjectType
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
B C MyObjectClasses
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
EDIT1:
Timings:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop
In [240]: %%timeit
...: df = pd.concat([df0, df1])
...: df['ObjectType'].update(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.18 ms per loop
In [242]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.21 ms per loop
In [243]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.28 ms per loop
You can merge two columns separated by Nan's into one using combine_first
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']
0 1
1 2
2 3
0 1
1 2
3 3
Name: ObjectType, dtype: float64

Categories

Resources