I know how to select all the numeric columns and fillna with mean,but how to make numeric columns fillna with mean and character columns fillna with mode?
Use select_dtypes for numeric columns with mean, then get non numeric with difference and mode, join together by append and last call fillna:
Notice: (thanks #jpp)
Function mode should return multiple values, for seelct first add iloc
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb')
})
df.loc[[0,1], 'F'] = np.nan
df.loc[[2,1], 'A'] = np.nan
print (df)
A B C D F
0 e NaN 7.0 1.0 NaN
1 NaN NaN NaN 3.0 NaN
2 NaN 4.0 9.0 5.0 a
3 d 5.0 4.0 NaN b
4 e 5.0 2.0 1.0 b
5 d 4.0 3.0 0.0 b
a = df.select_dtypes(np.number).mean()
b = df[df.columns.difference(a.index)].mode().iloc[0]
#alternative
#b = df.select_dtypes(object).mode().iloc[0]
print (df[df.columns.difference(a.index)].mode())
A F
0 d b
1 e NaN
df = df.fillna(a.append(b))
print (df)
A B C D F
0 e 4.5 7.0 1.0 b
1 d 4.5 5.0 3.0 b
2 d 4.0 9.0 5.0 a
3 d 5.0 4.0 2.0 b
4 e 5.0 2.0 1.0 b
5 d 4.0 3.0 0.0 b
Related
How could I use the pandas pad function (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pad.html)
to pad only selected columns in a dataframe ?
For example, in the below dataframe (df):
import pandas as pd
import numpy as np
df = pd.DataFrame([[2, np.nan, 0, 1],[np.nan, np.nan, 5, np.nan ],[np.nan,3,np.nan, 3]],columns=list('ABCD'))
print(df)
A B C D
0 2.0 NaN 0.0 1.0
1 NaN NaN 5.0 NaN
2 NaN 3.0 NaN 3.0
I would like to only pad column A and column D keep the rest of the columns as it is. The final dataframe should look like :
A B C D
0 2.0 NaN 0.0 1.0
1 2.0 NaN 5.0 1.0
2 2.0 3.0 NaN 3.0
Try column assignment and pad:
df[['A', 'D']] = df[['A', 'D']].pad()
>>> df
A B C D
0 2.0 NaN 0.0 1.0
1 2.0 NaN 5.0 1.0
2 2.0 3.0 NaN 3.0
>>>
I want to mask out the values in a Pandas DataFrame where the index is the same as the column name. For example:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=["a", "b", "c", "d"],
columns=["a", "b", "c"])
a b c
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
After masking:
a b c
a NaN 1 2
b 3 NaN 5
c 6 7 NaN
d 9 10 11
Seems simple enough but I'm not sure how to do it in a Pythonic way, without iteration.
Try using pd.DataFrame.apply to apply a condtion to each dataframe column that matches the index with the series/column name. The result of the apply is a boolean dataFrame and use pd.Dataframe.mask:
a.mask(a.apply(lambda x: x.name == x.index))
Output:
a b c
a NaN 1.0 2.0
b 3.0 NaN 5.0
c 6.0 7.0 NaN
d 9.0 10.0 11.0
Also, inspired by #QuangHoang you can use np.equal.outer:
a.mask(np.equal.outer(a.index, a.columns))
Output:
a b c
a NaN 1.0 2.0
b 3.0 NaN 5.0
c 6.0 7.0 NaN
d 9.0 10.0 11.0
You can use broadcasting as well:
a.mask(a.index[:,None] == a.columns[None,:])
Or
a.mask(a.index.values[:,None] == a.columns.values[None,:])
Output:
a b c
a NaN 1.0 2.0
b 3.0 NaN 5.0
c 6.0 7.0 NaN
d 9.0 10.0 11.0
Using DataFrame.stack and index.get_level_values:
st = a.stack()
m = st.index.get_level_values(0) == st.index.get_level_values(1)
a = st.mask(m).unstack()
a b c
a NaN 1.0 2.0
b 3.0 NaN 5.0
c 6.0 7.0 NaN
d 9.0 10.0 11.0
I have number of similar dataframes where I would like to standardize the nans across all the dataframes. For instance, if a nan exists in df1.loc[0,'a'] then ALL other dataframes should be set to nan for the same index location.
I am aware that I could group the dataframes to create one big multiindexed dataframe but sometimes I find it easier to work with a group of dataframes of the same structure.
Here is an example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
print df1
print ' '
print df2
print ' '
print df3
Output:
a b c
0 0.0 1 2
1 3.0 4 5
2 6.0 7 8
3 NaN 10 11
a b c
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
3 9 10.0 11
a b c
0 0 1 NaN
1 3 4 5.0
2 6 7 8.0
3 9 10 11.0
However, I would like df1, df2 and df3 to have nans in the same locations:
print df1
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
Using the answer provided by piRSquared, I was able to extend it for dataframes of different sizes. Here is the function:
def set_nans_over_every_df(df_list):
# Find unique index and column values
complete_index = sorted(set([idx for df in df_list for idx in df.index]))
complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))
# Ensure that every df has the same indexes and columns
df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]
# Find the nans in each df and set nans in every other df at the same location
mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
df_list = [df.mask(mask) for df in df_list]
return df_list
And an example using different sized dataframes:
df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])
print df1
a b c d
0 0.0 1.0 NaN NaN
1 3.0 NaN 5.0 NaN
2 6.0 7.0 8.0 NaN
3 NaN 10.0 11.0 NaN
4 NaN NaN NaN NaN
I'd set up a mask in numpy then use this mask in the pd.DataFrame.mask method
mask = np.isnan(np.stack([d.values for d in [df1, df2, df3]])).any(0)
print(df1.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df2.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df3.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
You can create mask and then apply to all dataframes:
mask = df1.notnull() & df2.notnull() & df3.notnull()
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
You can also set mask dynamically with reduce:
import functools
masks = [df1.notnull(),df2.notnull(),df3.notnull()]
mask = functools.reduce(lambda x,y: x & y, masks)
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
print (df1[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
assuming that all your DF are of the same shape and have the same indexes:
In [196]: df2[df1.isnull()] = df3[df1.isnull()] = np.nan
In [197]: df1[df3.isnull()] = df2[df3.isnull()] = np.nan
In [198]: df1[df2.isnull()] = df3[df2.isnull()] = np.nan
In [199]: df1
Out[199]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [200]: df2
Out[200]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [201]: df3
Out[201]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
One simple method is to add the DataFrames together and multiply the result by 0 and then add this DataFrame to all the others individually.
df_zero = (df1 + df2 + df3) * 0
df1 + df_zero
df2 + df_zero
df3 + df_zero
I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df
The output:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 NaN 7.0
3 NaN 6.0 8.0
a b c
0 1.0 4.0 0.0
1 2.0 5.0 0.0
2 3.0 0.0 7.0
3 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
You can select your desired columns and do it by assignment:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
The resulting output is as expected:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can using dict , fillna with different value for different column
df.fillna({'a':0,'b':0})
Out[829]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
After assign it back
df=df.fillna({'a':0,'b':0})
df
Out[831]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can avoid making a copy of the object using Wen's solution and inplace=True:
df.fillna({'a':0, 'b':0}, inplace=True)
print(df)
Which yields:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)
This should work and without copywarning
df[['a', 'b']] = df.loc[:,['a', 'b']].fillna(value=0)
Here's how you can do it all in one line:
df[['a', 'b']].fillna(value=0, inplace=True)
Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.
Or something like:
df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0
and if there is more:
for i in your_list:
df.loc[df[i].isnull(),i]=0
For some odd reason this DID NOT work (using Pandas: '0.25.1')
df[['col1', 'col2']].fillna(value=0, inplace=True)
Another solution:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Example:
df = pd.DataFrame(data={'col1':[1,2,np.nan,], 'col2':[1,np.nan,3], 'col3':[np.nan,2,3]})
output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 nan 2.00
2 nan 3.00 3.00
Apply list comp. to fillna values:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 0.00 2.00
2 0.00 3.00 3.00
Sometimes this syntax wont work:
df[['col1','col2']] = df[['col1','col2']].fillna()
Use the following instead:
df['col1','col2']
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN