Join DataFrame by Comparing columns - python

I have two dataframe:
df1:
df2:
I want update column 'D' of df1 such that if df2 have a lesser value for column d for same value of 'A' and 'B' column df.d will replace df1.d for that row.
expected output is:
Could someone help me with the python code for this?
Thanks.

df_new=pd.merge(df1,df2, how='left', on=['A','B'],suffixes=('', '_r'))#merge the two frames on A and B and suffix df2['D'] WITH R
df_new['D']=np.where(df_new['D']>df_new['D_r'],df_new['D_r'],df_new['D'])#Use np.where to replace column D with the right value as per condition
df_new.drop('D_r',1, inplace=True)#Drop the D_r column
A B C D
0 x 2 f 1.0
1 x 3 2 1.0
2 y 2 4 3.0
3 y 5 dfs 2.0
4 z 1 sds 5.0

Related

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

Pandas merge dataframes with shared column, fillna in left with right

I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?
# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})
# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')
# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])
# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)
The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:
slice the non-shared columns 'a','e' from df1
do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']
Solution:
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 NaN 2
The result would be
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 0.0 2
Which is as expected.
Further...
If you want to be more explicit when there may be more columns involved
df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))
Even Further...
If you don't want to update the dataframe, we can use combine_first
Quick
df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))
Explicit
df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))
EVEN FURTHER!...
The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:
df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)
Short version
df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Since you mentioned there will be multiple columns
df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Also we can pass to fillna with df
df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Only if the indices are alligned (important note), we can use update:
df1['b'].update(df2['b'])
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Or simply fillna:
df1['b'].fillna(df2['b'], inplace=True)
If you're indices are not alligned, see WenNYoBen's answer or comment underneath.
You can mask the data.
original data:
print(df)
one two three
0 1 1.0 1.0
1 2 NaN 2.0
2 3 3.0 NaN
print(df2)
one two three
0 4 4 4
1 4 2 4
2 4 4 3
See below, mask just fills based on condition.
# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
.mask(df1[['two','three']].isna(),df2[['two','three']])
output:
one two three
0 1 1.0 1.0
1 2 2.0 2.0
2 3 3.0 3.0

Insert values after a pd.DataFrame.query() and keep the original data

I have a df:
df = pd.DataFrame([[1,1],[3,4],[3,4]], columns=["a", 'b'])
a b
0 1 1
1 3 4
2 3 4
I have to filter this df based on a query. The query can be complex, but here I'm using a simple one:
items = [3,4]
df.query("a in #items and b == 4")
a b
1 3 4
2 3 4
Only to these rows I would like to add some values in new columns:
configuration = {'c': 'action', "d": "non-action"}
for k, v in configuration.items():
df[k] = v
The rest of the rows should have an empty value or np.nan. So my end df should look like:
a b c d
0 1 1 np.nan np.nan
1 3 4 action non-action
2 3 4 action non-action
The issue is that to do the query I end up with a copy of a dataframe. And then I have to somehow merged them and replace the modified rows by index. How to do it without replacing in the original df the rows by index with the queried one?
Using combine_first with assign
df.query("a in #items and b == 4").assign(**configuration).combine_first(df)
Out[138]:
a b c d
0 1.0 1.0 NaN NaN
1 3.0 4.0 action non-action
2 3.0 4.0 action non-action

Pandas: assign value depending on another dataframe

I have to dataframes that look like this:
df1: condition
A
A
A
B
B
B
B
df2: condition value
A 1
B 2
I would like to assign to each condition its value, adding a column to df1 in order to obtain:
df1: condition value
A 1
A 1
A 1
B 2
B 2
B 2
B 2
how can I do this? thank you in advance!
Use map by Series created by set_index if need append one column only:
df1['value'] = df1['condition'].map(df2.set_index('condition')['value'])
print (df1)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
Or use merge with left join if df2 have more columns:
df = df1.merge(df2, on='condition', how='left')
print (df)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2

Python, how to fill nulls in a data frame using a dictionary

I have a dataframe df something like this
A B C
1 'x' 15.0
2 'y' NA
3 'z' 25.0
and a dictionary dc something like
dc = {'x':15,'y':35,'z':25}
I want to fill all nulls in column C of the dataframe using values of column B from the dictionary. So that my dataframe will become
A B C
1 'x' 15
2 'y' 35
3 'z' 25
Could anyone help me how to do that please?
thanks,
Manoj
You can use fillna with map:
dc = {'x':15,'y':35,'z':25}
df['C'] = df.C.fillna(df.B.map(dc))
df
# A B C
#0 1 x 15.0
#1 2 y 35.0
#2 3 z 25.0
df['C'] = np.where(df['C'].isnull(), df['B'].apply(lambda x: dc[x]), df['C'])

Categories

Resources