I have a dataset that has four columns: ID, Step, col1 and col2.
The step columns have a row that has a NaN value, this is where the data for col1 and col2 are.
I want to fill the missing data of col1 and col2 for each unique ID.
the original data frame looks like:
ID Step col1 col2
7001 Nan 1.0 6.0
7001 0 Nan Nan
7001 1 Nan Nan
6500 Nan 12.0 3.0
6500 0 Nan Nan
6500 1 Nan Nan
I want this result:
ID Step col1 col2
7001 Nan 1.0 6.0
7001 0 1.0 6.0
7001 1 1.0 6.0
6500 Nan 12.0 3.0
6500 0 12.0 3.0
6500 1 12.0 3.0
I can't seem to find a good way to do this that is not too long as I have a lot of data to process (10 GB)
If your file is sorted, you can do a fillna with mode="ffill"
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
Note that you will have to apply per column, as otherwise the step would be (wrongly) propagated
y=pd.read_csv(StringIO(x),sep='\s+',na_values='Nan')
y['col1']=y['col1'].fillna(method='ffill')
y['col2']=y['col2'].fillna(method='ffill')
gives:
ID Step col1 col2
0 7001 NaN 1.0 6.0
1 7001 0.0 1.0 6.0
2 7001 1.0 1.0 6.0
3 6500 NaN 12.0 3.0
4 6500 0.0 12.0 3.0
5 6500 1.0 12.0 3.0
If your data is not ID sorted, you can always sort it first:
y_sorted= y.sort_values(["ID"])
First, It doesn't look like those are proper NaN values, let's make them so:
df.replace('Nan', np.nan, inplace=True)
Then we can ffill() the desired columns:
df[['col1', 'col2']] = df[['col1', 'col2']].ffill()
Ouput:
ID Step col1 col2
0 7001 NaN 1.0 6.0
1 7001 0 1.0 6.0
2 7001 1 1.0 6.0
3 6500 NaN 12.0 3.0
4 6500 0 12.0 3.0
5 6500 1 12.0 3.0
Let's take this dataframe as a simple example:
df = pd.DataFrame(dict(Col1=[np.nan,1,1,2,3,8,7], Col2=[1,1,np.nan,np.nan,3,np.nan,4], Col3=[1,1,np.nan,5,1,1,np.nan]))
Col1 Col2 Col3
0 NaN 1.0 1.0
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
5 8.0 NaN 1.0
6 7.0 4.0 NaN
I would like first to remove first and last rows until there is no longer NaN in the first and last row.
Intermediate expected output :
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
Then, I would like to replace the remaining NaN by the mean of the nearest value below which is not a NaN, and the one above.
Final expected output :
Col1 Col2 Col3
0 1.0 1.0 1.0
1 1.0 2.0 3.0
2 2.0 2.0 5.0
3 3.0 3.0 1.0
I know I can have the positions of NaN in my dataframe through
df.isna()
But I can't solve my problem. How please could I do ?
My approach:
# identify the rows with some NaN
s = df.notnull().all(1)
# remove those with NaN at beginning and at the end:
new_df = df.loc[s.idxmax():s[::-1].idxmax()]
# average:
new_df = (new_df.ffill()+ new_df.bfill())/2
Output:
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0
Another option would be to use DataFrame.interpolate with round:
nans = df.notna().all(axis=1).cumsum().drop_duplicates()
low, high = nans.idxmin(), nans.idxmax()
df.loc[low+1: high].interpolate().round()
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0
I have a df which looks like below. I want to fillNA with some value between two values.
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 NaN NaN 43.0 NaN NaN
3 34.0 34 NaN NaN 34.0 34.0 34.0 34.0
4 NaN 34 34.0 NaN 34.0 34.0 34.0 34.0
For Example, I dont want to fillna in first and second row, because NaN doesn't occur between values. But I want to fillna in third row at col4 and col5. because these two columns contains NaN between two values (col3 and col6).
How to do this,
Expected Output:
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 -100 -100 43.0 NaN NaN
3 34.0 34 -100 -100 34.0 34.0 34.0 34.0
4 NaN 34 34.0 -100 34.0 34.0 34.0 34.0
For this problem
I can't simply use fillna, because it will fill completely, similarly I can't use ffill or bfill, because it violate at leading or trailing values. I'm clueless at this stage. any help would be appreciable.
Note: After search related to this I'm raising this question. I don't find any duplicates related to this. If you find feel free to mark it as duplicate.
I think you need get boolean mask where are missing values without first and last one rows by 2 methods - forward fill and back fill missing values and check non missing or create cumulative sum with comparing >0:
m = df.ffill(axis=1).notnull() & df.bfill(axis=1).notnull()
#alternative mask
a = df.notnull()
m = a.cumsum(axis=1).gt(0) & a.iloc[:, ::-1].cumsum(axis=1).gt(0)
df = df.mask(m, df.fillna(-100))
print (df)
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 -100.0 -100.0 43.0 NaN NaN
3 34.0 34 -100.0 -100.0 34.0 34.0 34.0 34.0
4 NaN 34 34.0 -100.0 34.0 34.0 34.0 34.0
Detail:
print (m)
col1 col2 col3 col4 col5 col6 col7 col8
0 False True True True False False False False
1 True True True True True False False False
2 True True True True True True False False
3 True True True True True True True True
4 False True True True True True True True
How can I merge the rows of the dataframe1 into the dataframe2 ?
If one of the corresponding values is NaN then the value should be
copied from the other.
If both are NaN then NaN.
If none are NaN then the first one.
Dataframe1
Dataframe2
Thanks in advance
You can use combine_first:
df
Out:
col1 col2 col3 col4
0 NaN NaN 3.0 4
1 1.0 2.0 NaN 5
df.loc[0].combine_first(df.loc[1])
Out:
col1 1.0
col2 2.0
col3 3.0
col4 4.0
Name: 0, dtype: float64
In the specified format:
df.loc[0].combine_first(df.loc[1]).to_frame('Row1-2').T
Out:
col1 col2 col3 col4
Row1-2 1.0 2.0 3.0 4.0
An alternative:
df.loc[[0]].fillna(df.loc[1])
Out:
col1 col2 col3 col4
0 1.0 2.0 3.0 4
And a cleaner version of filling from #MaxU:
df.bfill().iloc[[0]]
Out:
col1 col2 col3 col4
0 1.0 2.0 3.0 4
I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df
The output:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 NaN 7.0
3 NaN 6.0 8.0
a b c
0 1.0 4.0 0.0
1 2.0 5.0 0.0
2 3.0 0.0 7.0
3 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
You can select your desired columns and do it by assignment:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
The resulting output is as expected:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can using dict , fillna with different value for different column
df.fillna({'a':0,'b':0})
Out[829]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
After assign it back
df=df.fillna({'a':0,'b':0})
df
Out[831]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can avoid making a copy of the object using Wen's solution and inplace=True:
df.fillna({'a':0, 'b':0}, inplace=True)
print(df)
Which yields:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)
This should work and without copywarning
df[['a', 'b']] = df.loc[:,['a', 'b']].fillna(value=0)
Here's how you can do it all in one line:
df[['a', 'b']].fillna(value=0, inplace=True)
Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.
Or something like:
df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0
and if there is more:
for i in your_list:
df.loc[df[i].isnull(),i]=0
For some odd reason this DID NOT work (using Pandas: '0.25.1')
df[['col1', 'col2']].fillna(value=0, inplace=True)
Another solution:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Example:
df = pd.DataFrame(data={'col1':[1,2,np.nan,], 'col2':[1,np.nan,3], 'col3':[np.nan,2,3]})
output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 nan 2.00
2 nan 3.00 3.00
Apply list comp. to fillna values:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 0.00 2.00
2 0.00 3.00 3.00
Sometimes this syntax wont work:
df[['col1','col2']] = df[['col1','col2']].fillna()
Use the following instead:
df['col1','col2']