I don't know why this is confusing me so much. I am trying to combine two dataframes, and both share the same index (although as a note, they may not be in the same order).
df1 = |firstrow 10|
|secondrow 15|
df2 = |secondrow 115|
|firstrow 1000|
and I want the resulting dataframe to be:
result = |firstrow 10 1000|
|secondrow 15 115|
I have tried doing this:
df = pd.merge(df1,df2, on="INDEXNAME"), but it throws a KeyError on INDEXNAME
thanks!
I think you can use concat (by default outer join):
df = pd.concat([df1, df2], axis=1)
And if need inner join:
df = pd.concat([df1, df2], axis=1, join='inner')
Or merge (by default inner join) with parameters left_index and right_index:
df = pd.merge(df1, df2, left_index=True, right_index=True)
Sample:
df1 = pd.DataFrame({'a':[10,15]}, index=['firstrow','secondrow'])
df2 = pd.DataFrame({'b':[115,1000]}, index=['secondrow','firstrow'])
print (df1)
a
firstrow 10
secondrow 15
print (df2)
b
secondrow 115
firstrow 1000
print (pd.concat([df1, df2], axis=1))
a b
secondrow 15 115
firstrow 10 1000
print (pd.merge(df1, df2, left_index=True, right_index=True))
a b
secondrow 15 115
firstrow 10 1000
Related
i have two df and i wanna check for the id if the value differs in both df if so i need to print those.
example:
df1 = |id |check_column1|
|1|abc|
|1|bcd|
|2|xyz|
|2|mno|
|2|mmm|
df2 =
|id |check_column2|
|1|bcd|
|1|abc|
|2|xyz|
|2|mno|
|2|kkk|
here the output should be just |2|mmm|kkk| but i am getting whole table as output since index are different
This is what i did
output = pd.merge(df1,df2, on= ['id'], how='inner')
event4 = output[output.apply(lambda x: x['check_column1'] != x['check_column2'], axis=1)]
Idea is sorting values per id in both columns and join with helper counter by GroupBy.cumcount, then is possible filtering not matched rows:
df1 = df1.sort_values(['id','check_column1'])
df2 = df2.sort_values(['id','check_column2'])
df = pd.merge(df1,df2, left_on= ['id',df1.groupby('id').cumcount()],
right_on= ['id',df2.groupby('id').cumcount()])
output = df[df['check_column1'] != df['check_column2']]
print (output)
id key_1 check_column1 check_column2
2 2 0 mmm kkk
You can use np.where to achieve this.
df1 = pd.DataFrame({'id':[1,1,2,2,2],'check_column1':['abc','bcd','xyz','mno','mmm']})
df2 = pd.DataFrame({'id':[1,1,2,2,2],'check_column2':['bcd','abc','xyz','mno','kkk']})
output = pd.merge(df1,df2, on= ['id'], how='inner')
event4 = np.where(output['check_column1']!=output['check_column2'],output[['id','check_column1']],output[['id','check_column2']])
Output:
array([[2, 'mmm'],
[2, 'kkk']], dtype=object)
mask = np.where((df1['id'] != df2['id']) | (df1['check_column1'] != df2['check_column2']), True, False)
output = df2[mask]
I have two dataframes which I am joining like so:
df3 = df1.join(df2.set_index('id'), on='id', how='left')
But I want to replace values for id-s which are present in df1 but not in df2 with NaN (left join will just leave the values in df1 as they are). Whats the easiest way to accomplish this?
I think you need Series.where with Series.isin:
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
Or numpy.where:
df1['id'] = np.where(df1['id'].isin(df2['id']), df1['id'], np.nan)
Sample:
df1 = pd.DataFrame({
'id':list('abc'),
})
df2 = pd.DataFrame({
'id':list('dmna'),
})
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
print (df1)
id
0 a
1 NaN
2 NaN
Or solution with merge and indicator parameter:
df3 = df1.merge(df2, on='id', how='left', indicator=True)
df3['id'] = df3['id'].mask(df3.pop('_merge').eq('left_only'))
print (df3)
id
0 a
1 NaN
2 NaN
Suppose I have the following 2 DataFrames:
df1, whose index is ['NameID', 'Date']. For example, df1 can be a panel dataset of historical salaries of employees in a company.
df2, whose index is ['NameID']. For example, df2 can be a dataset of employees' birthday and SSN.
What is the most efficient way to join df1 and df2 on 'NameID' as an index on a 1:m basis? DataFrame.join() doesn't allow 1:m join. I know I can first reset_index() for both df1 and df2, and then use DataFrame.merge() to join them on columns, but I think that is not efficient.
Code:
df1 = pd.DataFrame({'NameID':['A','B','C']*3,
'Date':['20180801']*3+['20180802']*3+['20180803']*3,
'Salary':np.random.rand(9)
})
df1 = df1.set_index(['NameID', 'Date'])
df1
NameID Date Salary
A 20180801 0.831064
B 20180801 0.419464
C 20180801 0.239779
A 20180802 0.500048
B 20180802 0.317452
C 20180802 0.188051
A 20180803 0.076196
B 20180803 0.060435
C 20180803 0.297118
df2 = pd.DataFrame({'NameID':['A','B','C'],
'SSN':[999,888,777]
})
df2 = df2.set_index(['NameID'])
df2
NameID SSN
A 999
B 888
C 777
The result I want to get is:
NameID Date Salary SSN
A 20180801 0.831064 999
A 20180802 0.500048 999
A 20180803 0.076196 999
B 20180801 0.419464 888
B 20180802 0.317452 888
B 20180803 0.060435 888
C 20180801 0.239779 777
C 20180802 0.188051 777
C 20180803 0.297118 777
You may want to merge.
df = pd.merge(df1, df2, on='NameID', how='left')
See Michael B's answer, but in addition, you might also want to sort to get your requested output:
pd.merge(df1, df2, on='NameID', how='left').sort_values('SSN', ascending=False)
Answering on behalf of warwick12
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
I have two dataframes, A and B, and I want to get those in A but not in B, just like the one right below the top left corner.
Dataframe A has columns ['a','b' + others] and B has columns ['a','b' + others]. There are no NaN values. I tried the following:
1.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) | (~dfA['b'].isin(dfm['b'])
2.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) & (~dfA['b'].isin(dfm['b'])
3.
dfe = dfA[(~dfA['a'].isin(dfB['a']) | (~dfA['b'].isin(dfB['b'])
4.
dfe = dfA[(~dfA['a'].isin(dfB['a']) & (~dfA['b'].isin(dfB['b'])
but when I get len(dfm) and len(dfe), they don't sum up to dfA (it's off by a few numbers). I've tried doing this on dummy cases and #1 works, so maybe my dataset may have some peculiarities I am unable to reproduce.
What's the right way to do this?
Check out this link
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True)
df = df[df['_merge'] == 'left_only']
One liner :
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True
).query('_merge=="left_only"')
I think it would go something like the examples in: Pandas left outer join multiple dataframes on multiple columns
dfe = pd.merge(dFA, dFB, how='left', on=['a','b'], indicator=True)
dfe[dfe['_merge'] == 'left_only']
What's the python/panda way to merge on multilevel dataframe on column "t" under "cell1" and "cell2"?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(4).reshape(2, 2),
columns = [['cell 1'] * 2, ['t', 'sb']])
df2 = pd.DataFrame([[1, 5], [2, 6]],
columns = [['cell 2'] * 2, ['t', 'sb']])
Now when I tried to merge on "t", python REPL will error out
ddf = pd.merge(df1, df2, on='t', how='outer')
What's a good way to handle this?
pd.merge(df1, df2, left_on=[('cell 1', 't')], right_on=[('cell 2', 't')])
One solution is to drop the top level (e.g. cell_1 and cell_2) from the dataframes and then merge.
If you want, you can save these columns to reinstate them after the merge.
c1 = df1.columns
c2 = df2.columns
df1.columns = df1.columns.droplevel()
df2.columns = df2.columns.droplevel()
df_merged = df1.merge(df2, on='t', how='outer', suffixes=['_df1', '_df2'])
df1.columns = c1
df2.columns = c2
>>> df_merged
t sb_df1 sb_df2
0 0 1 NaN
1 2 3 6
2 1 NaN 5