Pandas .fllna() not fililng in values even with inplace argument - python

I have a large dataframe (df) where I want to fill in missing values in certain columns. I tried to do this like the below but they are still coming back not filled in.
df[
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred'
, 'CMcvrcnt', 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred',
'OLcvrcnt']
].fillna(0, inplace = True)
I'm not sure what I'm missing. I'm on pandas 0.24.2

Just saw this on git: https://github.com/pandas-dev/pandas/issues/14858
The way to do this is
df.loc[:,
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred', 'CMcvrcnt'
, 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred', 'OLcvrcnt']
] = df.loc[:,
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred', 'CMcvrcnt'
, 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred', 'OLcvrcnt']
].fillna(0)

You can use:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [
np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list('ABCD'))
print(df)
df[['A', 'B', 'C', 'D']] = df[['A', 'B', 'C', 'D']].fillna(0)
print(df)
df before:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
df after:
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4

Related

Getting mean of specific column (not dataframe) and using it to replace every NAN value in related column

Suppose I have this dataset and it had 2 NAN values in columns 'alcohol' and 3 NAN values in column 'magnesium'. They do not have NAN values, but suppose they did.
What lines of code might I use to get not only the mean of the appropriate column (alcohol mean for alcohol), but also fill/replace alcohol NAN values with this mean? The same for magnesium.
There are questions on stackoverflow regarding a mean that is a mean of the entire dataframe as opposed to the column in particular.
I know this may be possible with sklearn.impute and sklearn.preprocessing
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
Try this:
df.fillna(df[["alcohol", "magnesium"]].mean())
Example:
df = pd.DataFrame({
"col1": [1, 2, 3, np.NaN, 5, 6],
"alcohol": [1, 2, 3, np.NaN, np.NaN, 6],
"magnesium": [1, np.NaN, np.NaN, np.NaN, 5, 6],
"col4": [1, 2, 3, np.NaN, 5, 6]})
df.fillna(df[["alcohol", "magnesium"]].mean())
gives you:
col1 alcohol magnesium col4
0 1.0 1.0 1.0 1.0
1 2.0 2.0 4.0 2.0
2 3.0 3.0 4.0 3.0
3 NaN 3.0 4.0 NaN
4 5.0 3.0 5.0 5.0
5 6.0 6.0 6.0 6.0
df.mean() will give the mean per column, so you can use:
df.fillna(df.mean())
Note that if a column is full of null values the mean of that column
will be null as well.

How to fillna (impute) by using the mean of the last 3 rows in the same column?

so I was trying to impute some missing values with fillna() in pandas, but I don't really know how to impute by using the mean value of the last 3 rows in the same column (not the mean value of the entire column), so if anyone can help it will be greatly appreciated, thanks
You can fillna with rolling(3).mean(). shift gets the alignment correct. This approach fills everything at once, so for consecutive NaN values the fillings are independent. If you need iterative filling (fills the first NaN then that value is used to compute the fill value in the next consecutive NaN) then it cannot be done in this way.
df = pd.DataFrame({'col1': [np.NaN, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
# Fill if
# at least
# one value
df.fillna(df.rolling(3, min_periods=1).mean().shift()) # works for many cols at once
col1
0 NaN # Unfilled because < min_periods
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
You could do:
df.fillna(df.iloc[-3:].mean())
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1':[1, 2, 3, np.nan, 5, 6, 7],
'var2':[np.nan, np.nan, np.nan, np.nan, np.nan, 1, 0]})
var1 var2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN NaN
4 5.0 NaN
5 6.0 1.0
6 7.0 0.0
print(df.fillna(df.iloc[-3:].mean()))
Output:
var1 var2
0 1.0 0.5
1 2.0 0.5
2 3.0 0.5
3 6.0 0.5
4 5.0 0.5
5 6.0 1.0
6 7.0 0.0
Dan's solution is much simpler if the kink is worked out. If not, this will accomplish it:
df2 = df1.fillna('nan') # Just filling them for the loop
dfrows = df2.shape[0]
dfcols = df2.shape[1]
for row in range(dfrows):
for col in range(dfcols):
if df2.iloc[row, col] == ('nan'):
df2.iloc[row,col] = (df2.iloc[row-1,col] + df2.iloc[row-2,col] + df2.iloc[row-3,col])/3
df2

Pandas groupby to create new dataframe with values as columns

I want to reshape the data by Date in Python as dataframe.
Required:
IS there any Pandas function?
Create additional key by using cumcount , then we do pivot , Data from jpp
df.assign(key=df.groupby('Col1').cumcount()).pivot('key','Col1','Col2')
Out[29]:
Col1 A B C
key
0 1.0 4.0 6.0
1 2.0 5.0 7.0
2 3.0 NaN 8.0
One way is to use pandas.concat on series derived from unique values in your key column.
Here is a minimal example.
import pandas as pd
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Col2': [1, 2, 3, 4, 5, 6, 7, 8]})
res = pd.concat({k: df.loc[df['Col1']==k, 'Col2'].reset_index(drop=True)
for k in df['Col1'].unique()}, axis=1)
print(res)
A B C
0 1 4.0 6
1 2 5.0 7
2 3 NaN 8

Pandas missing values : fill with the closest non NaN value

Assume I have a pandas series with several consecutive NaNs. I know fillna has several methods to fill missing values (backfill and fill forward), but I want to fill them with the closest non NaN value. Here's an example of what I have:
s = pd.Series([0, 1, np.nan, np.nan, np.nan, np.nan, 3])
And an example of what I want:
s = pd.Series([0, 1, 1, 1, 3, 3, 3])
Does anyone know I could do that?
Thanks!
You could use Series.interpolate with method='nearest':
In [11]: s = pd.Series([0, 1, np.nan, np.nan, np.nan, np.nan, 3])
In [12]: s.interpolate(method='nearest')
Out[12]:
0 0.0
1 1.0
2 1.0
3 1.0
4 3.0
5 3.0
6 3.0
dtype: float64
In [13]: s = pd.Series([0, 1, np.nan, np.nan, 2, np.nan, np.nan, 3])
In [14]: s.interpolate(method='nearest')
Out[14]:
0 0.0
1 1.0
2 1.0
3 2.0
4 2.0
5 2.0
6 3.0
7 3.0
dtype: float64

Replace missing values in all columns except one in pandas dataframe

I have a pandas dataframe with 10 columns and I want to fill missing values for all columns except one (lets say that column is called test). Currently, if I do this:
df.fillna(df.median(), inplace=True)
It replaces NA values in all columns with median value, how do I exclude specific column(s) without specifying ALL the other columns
you can use pd.DataFrame.drop to help out
df.drop('unwanted_column', 1).fillna(df.median())
Or pd.Index.difference
df.loc[:, df.columns.difference(['unwanted_column'])].fillna(df.median())
Or just
df.loc[:, df.columns != 'unwanted_column']
Input to difference function should be passed as an array (Edited).
Just select whatever columns you want using pandas' column indexing:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [np.nan, 5, 2, np.nan, 3], 'B': [np.nan, 4, 3, 5, np.nan], 'C': [np.nan, 4, 3, 2, 1]})
>>> df
A B C
0 NaN NaN NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 NaN 5.0 2.0
4 3.0 NaN 1.0
>>> cols = ['A', 'B']
>>> df[cols] = df[cols].fillna(df[cols].median())
>>> df
A B C
0 3.0 4.0 NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 3.0 5.0 2.0
4 3.0 4.0 1.0

Categories

Resources