fill NaN values with mean based on another column specific value - python

I want to fill the NaN values on my dataframe on column c with the mean for only rows who has as category B, and ignore the others.
print (df)
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 NaN
4 A 2 1.0
5 B 2 Nan
6 C 1 3.0
7 C 1 2.0
8 B 1 NaN
So what I'm doing for the moment is :
df.c = df.c.fillna(df.c.mean())
But it fill all the NaN values, while I want only to fill the 3rd, 5th and the 8th rows who had category value equal to B.

Combine fillna with slicing assignment
df.loc[df.Category.eq('B'), 'c'] = (df.loc[df.Category.eq('B'), 'c'].
fillna(df.c.mean()))
Out[736]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
Or a direct assignment with 2 masks
pandas.DataFrame.eq is the element wise equality operator.
df.loc[df.Category.eq('B') & df.c.isna(), 'c'] = df.c.mean()
Out[745]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0

This would be the answer for your question:
df.c = df.apply(
lambda row: row['c'].fillna(df.c.mean()) if row['Category']=='B' else row['c'] ,axis=1)

Related

compare two columns row by row and nan duplicate values pandas

I have a df
a b c
0 3 0
1 1 4
2 3 3
4 4 1
I want to compare a and b to c. If a value in the same row is equal to c I want 'nan' in a and/or b.
Like that:
a b c
nan 3 0
1 1 4
2 nan 3
4 4 1
We can use to_numpy with DataFrame.mask for this:
eqs = df.loc[:, :'b'].eq(df['c'].to_numpy()[:, None])
df.loc[:, :'b'] = df.loc[:, :'b'].mask(eqs)
a b c
0 NaN 3.0 0
1 1.0 1.0 4
2 2.0 NaN 3
3 4.0 4.0 1

How to do forward filling for each group in pandas

I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?
Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0

Pandas insert empty row at 0th position

Suppose have following data frame
A B
1 2 3 4 5
4 5 6 7 8
I want to check if df(0,0) is nan then insert pd.series(np.nan) at 0th position. So in above case it will be
A B
1 2 3 4 5
4 5 6 7 8
I am able to check (0,0) element but how do I insert empty row at first position?
Use append of DataFrame with one empty row:
df1 = pd.DataFrame([[np.nan] * len(df.columns)], columns=df.columns)
df = df1.append(df, ignore_index=True)
print (df)
A B C D E
0 NaN NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0
Perhaps you can first append a row with zeros, shift the whole rows and overwrite the first with 0:
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
df.loc[len(df)] = 0
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
2 0 0 0 0 0
df = df.shift()
df.loc[0] = 0
df
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0

Add new dataframe to existing database but only add if column name matches

I have two dataframes that I am trying to combine but I'm not getting the result I want using pandas.concat.
I have a database of data that I want to add new data to but only if the column of name matches.
Let says df1 is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
and df2 is:
A E D F
7 7 8 8
9 9 0 0
the result I would like to get is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
7 - - 8
9 - - 0
The blank data doesn't have to be - it can be anything.
When I use:
results = pandas.concat([df1, df2], axis=0, join='outer')
it gives me a new dataframe with all of the columns A through F, instead of what I want. Any ideas for how I can accomplish this? Thanks!
You want to use the pd.DataFrame.align method and specify that you want to align with the left argument's indices and that you only care about columns.
d1, d2 = df1.align(df2, join='left', axis=1)
Then you can use pd.DataFrame.append or pd.concat
pd.concat([d1, d2], ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
Or
d1.append(d2, ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
My preferred way would be to skip the reassignment to names
pd.concat(df1.align(df2, 'left', 1), ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
You can use find the intersection of columns on df2 and concat or append:
pd.concat(
[df1, df2[df1.columns.intersection(df2.columns)]]
)
Or,
df1.append(df2[df1.columns.intersection(df2.columns)])
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
You can also use reindex and concat:
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[81]:
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
Transpose first before merging.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True).T
A B C D
0_x 1.0 1.0 2.0 2.0
1_x 3.0 3.0 4.0 4.0
2 5.0 5.0 6.0 6.0
0_y 7.0 NaN NaN 8.0
1_y 9.0 NaN NaN 0.0
df1.T df2.T
0 1 2 1 2
A 1 3 5 A 7 9
B 1 3 5 E 7 9
C 2 4 6 D 8 0
D 2 4 6 F 8 0
Now the result can be obtained with a merge with how="left" and we use the indices as the join key by passing left_index=True and right_index=True.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True)
0_x 1_x 2 0_y 1_y
A 1 3 5 7.0 9.0
B 1 3 5 NaN NaN
C 2 4 6 NaN NaN
D 2 4 6 8.0 0.0

Pandas: How can I fill in the n/a with the mean of previous none-empty value and next none-empty value

I have some N/A value in my dataframe
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,np.nan,3,5],
'D':[2,np.nan, np.nan, 6]})
print(df)
A B C D
0 1 1 1.0 2.0
1 1 1 NaN NaN
2 1 1 3.0 NaN
3 3 3 5.0 6.0
How can I fill in the n/a value with the mean of its previous non-empty value and next non-empty value in its column?
For example, the second value in column C should be filled in with (1+3)/2= 2
Desired Output:
A B C D
0 1 1 1.0 2.0
1 1 1 2.0 4.0
2 1 1 3.0 4.0
3 3 3 5.0 6.0
Thanks!
Use ffill and bfill for replace NaNs by forward and back filling, then concat and groupby by index with aggregate mean:
df1 = pd.concat([df.ffill(), df.bfill()]).groupby(level=0).mean()
print (df1)
A B C D
0 1 1 1.0 2.0
1 1 1 2.0 4.0
2 1 1 3.0 4.0
3 3 3 5.0 6.0
Detail:
print (df.ffill())
A B C D
0 1 1 1.0 2.0
1 1 1 1.0 2.0
2 1 1 3.0 2.0
3 3 3 5.0 6.0
print (df.bfill())
A B C D
0 1 1 1.0 2.0
1 1 1 3.0 6.0
2 1 1 3.0 6.0
3 3 3 5.0 6.0

Categories

Resources