Assume I have a pandas series with several consecutive NaNs. I know fillna has several methods to fill missing values (backfill and fill forward), but I want to fill them with the closest non NaN value. Here's an example of what I have:
s = pd.Series([0, 1, np.nan, np.nan, np.nan, np.nan, 3])
And an example of what I want:
s = pd.Series([0, 1, 1, 1, 3, 3, 3])
Does anyone know I could do that?
Thanks!
You could use Series.interpolate with method='nearest':
In [11]: s = pd.Series([0, 1, np.nan, np.nan, np.nan, np.nan, 3])
In [12]: s.interpolate(method='nearest')
Out[12]:
0 0.0
1 1.0
2 1.0
3 1.0
4 3.0
5 3.0
6 3.0
dtype: float64
In [13]: s = pd.Series([0, 1, np.nan, np.nan, 2, np.nan, np.nan, 3])
In [14]: s.interpolate(method='nearest')
Out[14]:
0 0.0
1 1.0
2 1.0
3 2.0
4 2.0
5 2.0
6 3.0
7 3.0
dtype: float64
Related
Suppose I have this dataset and it had 2 NAN values in columns 'alcohol' and 3 NAN values in column 'magnesium'. They do not have NAN values, but suppose they did.
What lines of code might I use to get not only the mean of the appropriate column (alcohol mean for alcohol), but also fill/replace alcohol NAN values with this mean? The same for magnesium.
There are questions on stackoverflow regarding a mean that is a mean of the entire dataframe as opposed to the column in particular.
I know this may be possible with sklearn.impute and sklearn.preprocessing
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
Try this:
df.fillna(df[["alcohol", "magnesium"]].mean())
Example:
df = pd.DataFrame({
"col1": [1, 2, 3, np.NaN, 5, 6],
"alcohol": [1, 2, 3, np.NaN, np.NaN, 6],
"magnesium": [1, np.NaN, np.NaN, np.NaN, 5, 6],
"col4": [1, 2, 3, np.NaN, 5, 6]})
df.fillna(df[["alcohol", "magnesium"]].mean())
gives you:
col1 alcohol magnesium col4
0 1.0 1.0 1.0 1.0
1 2.0 2.0 4.0 2.0
2 3.0 3.0 4.0 3.0
3 NaN 3.0 4.0 NaN
4 5.0 3.0 5.0 5.0
5 6.0 6.0 6.0 6.0
df.mean() will give the mean per column, so you can use:
df.fillna(df.mean())
Note that if a column is full of null values the mean of that column
will be null as well.
So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?
I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6
Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]
What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.
I have a large dataframe (df) where I want to fill in missing values in certain columns. I tried to do this like the below but they are still coming back not filled in.
df[
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred'
, 'CMcvrcnt', 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred',
'OLcvrcnt']
].fillna(0, inplace = True)
I'm not sure what I'm missing. I'm on pandas 0.24.2
Just saw this on git: https://github.com/pandas-dev/pandas/issues/14858
The way to do this is
df.loc[:,
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred', 'CMcvrcnt'
, 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred', 'OLcvrcnt']
] = df.loc[:,
['incurred', 'noncatincrd', 'catincrd', 'clmcnt', 'noncatcnt', 'catcnt', 'cvrcnt', 'CMincurred', 'CMcvrcnt'
, 'CLincurred', 'CLcvrcnt', 'PDincurred', 'PDcvrcnt', 'OLincurred', 'OLcvrcnt']
].fillna(0)
You can use:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [
np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list('ABCD'))
print(df)
df[['A', 'B', 'C', 'D']] = df[['A', 'B', 'C', 'D']].fillna(0)
print(df)
df before:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
df after:
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
so I was trying to impute some missing values with fillna() in pandas, but I don't really know how to impute by using the mean value of the last 3 rows in the same column (not the mean value of the entire column), so if anyone can help it will be greatly appreciated, thanks
You can fillna with rolling(3).mean(). shift gets the alignment correct. This approach fills everything at once, so for consecutive NaN values the fillings are independent. If you need iterative filling (fills the first NaN then that value is used to compute the fill value in the next consecutive NaN) then it cannot be done in this way.
df = pd.DataFrame({'col1': [np.NaN, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
# Fill if
# at least
# one value
df.fillna(df.rolling(3, min_periods=1).mean().shift()) # works for many cols at once
col1
0 NaN # Unfilled because < min_periods
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
You could do:
df.fillna(df.iloc[-3:].mean())
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1':[1, 2, 3, np.nan, 5, 6, 7],
'var2':[np.nan, np.nan, np.nan, np.nan, np.nan, 1, 0]})
var1 var2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN NaN
4 5.0 NaN
5 6.0 1.0
6 7.0 0.0
print(df.fillna(df.iloc[-3:].mean()))
Output:
var1 var2
0 1.0 0.5
1 2.0 0.5
2 3.0 0.5
3 6.0 0.5
4 5.0 0.5
5 6.0 1.0
6 7.0 0.0
Dan's solution is much simpler if the kink is worked out. If not, this will accomplish it:
df2 = df1.fillna('nan') # Just filling them for the loop
dfrows = df2.shape[0]
dfcols = df2.shape[1]
for row in range(dfrows):
for col in range(dfcols):
if df2.iloc[row, col] == ('nan'):
df2.iloc[row,col] = (df2.iloc[row-1,col] + df2.iloc[row-2,col] + df2.iloc[row-3,col])/3
df2
I have a pandas dataframe with 10 columns and I want to fill missing values for all columns except one (lets say that column is called test). Currently, if I do this:
df.fillna(df.median(), inplace=True)
It replaces NA values in all columns with median value, how do I exclude specific column(s) without specifying ALL the other columns
you can use pd.DataFrame.drop to help out
df.drop('unwanted_column', 1).fillna(df.median())
Or pd.Index.difference
df.loc[:, df.columns.difference(['unwanted_column'])].fillna(df.median())
Or just
df.loc[:, df.columns != 'unwanted_column']
Input to difference function should be passed as an array (Edited).
Just select whatever columns you want using pandas' column indexing:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [np.nan, 5, 2, np.nan, 3], 'B': [np.nan, 4, 3, 5, np.nan], 'C': [np.nan, 4, 3, 2, 1]})
>>> df
A B C
0 NaN NaN NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 NaN 5.0 2.0
4 3.0 NaN 1.0
>>> cols = ['A', 'B']
>>> df[cols] = df[cols].fillna(df[cols].median())
>>> df
A B C
0 3.0 4.0 NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 3.0 5.0 2.0
4 3.0 4.0 1.0