Fill empty values in a dataframe based on columns in another dataframe - python

I have a dataframe df1 like this.
I want to fill the nan and the number 0 in column score with mutiple values in another dataframe df2 according to the different names.
How could I do this?

Option 1
Short version
df1.score = df1.score.mask(df1.score.eq(0)).fillna(
df1.name.map(df2.set_index('name').score)
)
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
Option 2
Interesting version using searchsorted. df2 must be sorted by 'name'.
i = np.where(np.isnan(df1.score.mask(df1.score.values == 0).values))[0]
j = df2.name.values.searchsorted(df1.name.values[i])
df1.score.values[i] = df2.score.values[j]
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0

If df1 and df2 are your dataframes, you can create a mapping and then call pd.Series.replace:
df1 = pd.DataFrame({'name' : ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'A'],
'score': [0, 32, 0, np.nan, np.nan, 45, np.nan, np.nan]})
df2 = pd.DataFrame({'name' : ['A', 'B', 'C'], 'score' : [10, 20, 30]})
print(df1)
name score
0 A 0.0
1 B 32.0
2 A 0.0
3 C NaN
4 B NaN
5 A 45.0
6 A NaN
7 A NaN
print(df2)
name score
0 A 10
1 B 20
2 C 30
mapping = dict(df2.values)
df1.loc[(df1.score.isnull()) | (df1.score == 0), 'score'] =\
df1[(df1.score.isnull()) | (df1.score == 0)].name.replace(mapping)
print(df1)
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0

Or using merge, fillna
import pandas as pd
import numpy as np
df1.loc[df.score==0,'score']=np.nan
df1.merge(df2,on='name',how='left').fillna(method='bfill',axis=1)[['name','score_x']]\
.rename(columns={'score_x':'score'})

This method changes the order (the result will be sorted by name).
df1.set_index('name').replace(0, np.nan).combine_first(df2.set_index('name')).reset_index()
name score
0 A 10
1 A 10
2 A 45
3 A 10
4 A 10
5 B 32
6 B 20
7 C 30

Related

Division in pandas dataframe

I am trying to divide my data frame with one of its columns:
Here is my data frame:
A
B
C
1
10
10
2
20
30
3
15
33
Now, I want to divide columns "b" and "c" by column "a", my desired output be like:
A
B
C
1
10
10
2
10
15
3
5
11
df/df['a']
Use DataFrame.div:
df[['B','C']] = df[['B','C']].div(df['A'], axis=0)
print (df)
A B C
0 1 10.0 10.0
1 2 10.0 15.0
2 3 5.0 11.0
If need divide all columns without A:
cols = df.columns.difference(['A'])
df[cols] = df[cols].div(df['A'], axis=0)
try this:
d = {
'A': [1,2,3],
'B': [10,20,15],
'C': [10,30,33]
}
df = pd.DataFrame(d)
df['B'] = df['B']/df['A']
df['C'] = df['C']/df['A']
print(df)
Output:
A B C
0 1 10.0 10.0
1 2 10.0 15.0
2 3 5.0 11.0

How to use .bfill() with pandas groupby without dropping the grouping variable

I want to use bfill and groupby but have not figured out a way to do so without dropping the grouping variable. I know I can just concatenate back the ID column but there's gotta be another way of doing this.
import pandas as pd
import numpy as np
test = pd.DataFrame({'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'dd': [0, 0, 0, 0, 0, 0],
'nu': np.array([0, 1, np.NaN, np.NaN, 10, 20])})
In [11]:test.groupby('ID').bfill()
Out[11]:
nu
0 0.0
1 1.0
2 NaN
3 10.0
4 10.0
5 20.0
Desired output
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0
Try df.assign:
>>> test.assign(nu=test.groupby('ID').bfill()['nu'])
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0
Or df.groupby.apply,
>>> test.groupby('ID').apply(lambda x:x.bfill())
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0

Pandas Merge - Bring in identical columnvalues based on keys

I have 3 dataframes like this,
df = pd.DataFrame([[1, 3], [2, 4], [3,6], [4,12], [5,18]], columns=['A', 'B'])
df2 = pd.DataFrame([[1, 5], [2, 6], [3,9]], columns=['A', 'C'])
df3 = pd.DataFrame([[4, 15, "hello"], [5, 19, "yes"]], columns=['A', 'C', 'D'])
They look like this,
df
A B
0 1 3
1 2 4
2 3 6
3 4 12
4 5 18
df2
A C
0 1 5
1 2 6
2 3 9
df3
A C D
0 4 15 hello
1 5 19 yes
my merges, first merge,
f_merge = pd.merge(df, df2, on='A',how='left')
second merge,(first_merge with df3)
s_merge = pd.merge(f_merge, df3, on='A', how='left')
I get the output like this,
A B C_x C_y D
0 1 3 5.0 NaN NaN
1 2 4 6.0 NaN NaN
2 3 6 9.0 NaN NaN
3 4 12 NaN 15.0 hello
4 5 18 NaN 19.0 yes
I need like this,
A B C D
0 1 3 5.0 NaN
1 2 4 6.0 NaN
2 3 6 9.0 NaN
3 4 12 15.0 hello
4 5 18 19.0 yes
How can I achieve this output? Any suggestion would be great.
Concat df2 and df3 before merging.
new_df = pd.merge(df, pd.concat([df2, df3], ignore_index=True), on='A')
new_df
Out:
A B C D
0 1 3 5 NaN
1 2 4 6 NaN
2 3 6 9 NaN
3 4 12 15 hello
4 5 18 19 yes
We can do combine_first
df.set_index('A',inplace=True)
df2.set_index('A').combine_first(df).combine_first(df3.set_index('A'))
B C D
A
1 3.0 5.0 NaN
2 4.0 6.0 NaN
3 6.0 9.0 NaN
4 12.0 15.0 hello
5 18.0 19.0 yes

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

Pandas Create New Column Based on Value in Another Column, If False Return Previous Value of New Column

this is a Python pandas problem I've been struggling with for a while now. Lets say I have a simple dataframe df where df['a'] = [1,2,3,1,4,6] and df['b'] = [10,20,30,40,50,60]. I would like to create a third column 'c', where if the value of df['a'] == 1, df['c'] = df['b']. If this is false, df['c'] = the previous value of df['c']. I have tried using np.where to make this happen, but the result is not what I was expecting. Any advice?
df = pd.DataFrame()
df['a'] = [1,2,3,1,4,6]
df['b'] = [10,20,30,40,50,60]
df['c'] = np.nan
df['c'] = np.where(df['a'] == 1, df['b'], df['c'].shift(1))
The result is:
a b c
0 1 10 10.0
1 2 20 NaN
2 3 30 NaN
3 1 40 40.0
4 4 50 NaN
5 6 60 NaN
Whereas I would have expected:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0
Try this:
df.c.ffill(inplace=True)
Output:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0

Categories

Resources