Let's say df1 looks like:
id x
a 1
b 2
b 3
c 4
and df2 looks like:
id y
b 9
b 8
How do I merge them so that the output is:
id x y
b 2 9
b 3 8
I've tried pd.merge(df1, df2, on='id') but it is giving me:
id x y
b 2 9
b 2 8
b 3 9
b 3 8
which is not what I want.
IIUC, GroupBy.cumcount + merge
new_df = (df1.assign(count=df1.groupby('id').cumcount())
.merge(df2.assign(count=df2.groupby('id').cumcount()),
on=['id', 'count'], how='inner')
.drop(columns='count'))
id x y
0 b 2 9
1 b 3 8
Related
df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
I'm trying to merge df1 and df2 on B and keep all A B C and D columns:
A B C D
0 3.0 w 1 10.0
1 5.0 x 2 20.0
2 2.0 y 3 30.0
3 5.0 z 4 40.0
I've tried df1.merge(df2, how='outer', on='B')
A B C_x C_y D
0 3 w 0 1 10
1 5 x 0 2 20
2 2 y 0 3 30
3 5 z 0 4 40
which is almost what I want, but need C in df2 to replace C in df1. How can I achieve that?
If you don't want C from the lefthand side at all you could simply drop it before the merge:
df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result = pd.merge(
df1.drop('C', axis=1),
df2,
how='outer',
on='B')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
Edit:
However, if you're wanting to combine in cases where you don't have C from df2 you could utilize combine_first():
df1 = pd.DataFrame({'A':[3,5,2,5,6], 'B':['w','x','y','z','q'], 'C':['0','0','0','0','88']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result_2 = pd.merge(df1, df2, how='outer', on='B')
result_2['C'] = result_2['C_y'].combine_first(result_2['C_x'])
result_2.drop(['C_x', 'C_y'], axis=1, inplace=True)
A B D C
0 3 w 10.0 1
1 5 x 20.0 2
2 2 y 30.0 3
3 5 z 40.0 4
4 6 q 88
here is one way to do it, by choosing the column you need to include in the merge
df1[['A','B']].merge(df2,
on='B',
how='outer')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
I got the following data structure for object 1:
dayofweek A B C
Monday 1 2 3
Tuesday 4 5 6
All those items A, B, C I got for other objects, let's say Obj1. Obj2, Obj3.
I wanna put all the date in one dataframe with the MultiIndex columns structure:
object Obj1 Obj2 Obj3
dayofweek A B C A B C A B C
Monday 1 2 3 2 1 3 3 2 1
Tuesday 4 5 6 5 4 6 6 5 4
How can I do it easily? I tried to use .unstack(), but it puts objects' label below A, B, C columns
Use concat with keys parameter for MultiIndex with rename columns:
df = df.set_index('dayofweek')
df1 = df.rename(columns={'A':'B', 'B':'A'}).sort_index(axis=1)
df2 = df.rename(columns={'A':'C', 'C':'A'}).sort_index(axis=1)
df3 = pd.concat([df, df1, df2], keys=('Obj1','Obj2','Obj3'), axis=1)
print (df3)
Obj1 Obj2 Obj3
A B C A B C A B C
dayofweek
Monday 1 2 3 2 1 3 3 2 1
Tuesday 4 5 6 5 4 6 6 5 4
If there are 3 DataFrames with dayofweek column use:
dfs = [df, df1, df2]
df3 = pd.concat([x.set_index('dayofweek') for x in dfs], keys=('Obj1','Obj2','Obj3'), axis=1)
print (df3)
Try to use merge:
print(obj1.merge(obj2, on='dayofweek').merge(obj3, on='dayofweek'))
result:
dayofweek A_x B_x C_x A_y B_y C_y A B C
0 Monday 1 2 3 2 1 3 3 2 1
1 Tuesday 4 5 6 5 4 6 6 5 4
I want to pick only rows from df1 where both values of columns A and B in df1 match values of columns A and B in df2 so for example if df 1 and df2 are as follow:
df1
A B C
1 2 3
4 5 6
6 7 8
df2
A B D E
1 2 6 8
2 3 7 9
4 5 2 1
the result will be a subset of df1 rows, in this example, result will look like:
df1
A B C
1 2 3
4 5 6
Use:
df = pd.merge(df1, df2[["A", "B"]], on=["A", "B"], how="inner")
print(df)
This prints:
A B C
0 1 2 3
1 4 5 6
How I can merge following two data frames on columns A and B:
df1
A B C
1 2 3
2 8 2
4 7 9
df2
A B C
5 6 7
2 8 9
And with result to get only results of those two matching rows.
df3
A B C
2 8 2
2 8 9
You can concatenate them and drop the ones that are not duplicated:
conc = pd.concat([df1, df2])
conc[conc.duplicated(subset=['A', 'B'], keep=False)]
Out:
A B C
1 2 8 2
1 2 8 9
If you have duplicates,
df1
Out:
A B C
0 1 2 3
1 2 8 2
2 4 7 9
3 4 7 9
4 2 8 5
df2
Out:
A B C
0 5 6 7
1 2 8 9
3 5 6 4
4 2 8 10
You can keep track of the duplicated ones via boolean arrays:
cols = ['A', 'B']
bool1 = df1[cols].isin(df2[cols].to_dict('l')).all(axis=1)
bool2 = df2[cols].isin(df1[cols].to_dict('l')).all(axis=1)
pd.concat([df1[bool1], df2[bool2]])
Out:
A B C
1 2 8 2
4 2 8 5
1 2 8 9
4 2 8 10
Solution with Index.intersection, then select values in both DataFrames by loc and last concat them together:
df1.set_index(['A','B'], inplace=True)
df2.set_index(['A','B'], inplace=True)
idx = df1.index.intersection(df2.index)
print (idx)
MultiIndex(levels=[[2], [8]],
labels=[[0], [0]],
names=['A', 'B'],
sortorder=0)
df = pd.concat([df1.loc[idx],df2.loc[idx]]).reset_index()
print (df)
A B C
0 2 8 2
1 2 8 9
Here is a less efficient method that should preserve duplicates, but involves two merge/joins
# create a merged DataFrame with variables C_x and C_y with the C values
temp = pd.merge(df1, df2, how='inner', on=['A', 'B'])
# join columns A and B to a stacked DataFrame with the Cs on index
temp[['A', 'B']].join(
pd.DataFrame({'C':temp[['C_x', 'C_y']].stack()
.reset_index(level=1, drop=True)})).reset_index(drop=True)
This returns
A B C
0 2 8 2
1 2 8 9
I have a dataframe looks like this:
In [4]:
import pandas as pd
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[4]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
I just want to group row which has same value in column a. The desired output is like this:
df
Out[4]:
a b
0 A 1
2
1 B 5
5
4
2 C 6
EDIT:
I am sorry, actually the desired output may be like this:
df
Out[4]:
b
A 1
2
B 5
5
4
C 6
I think you are looking for set_index rather than groupby:
In [11]: df.set_index('a')
Out[11]:
b
a
A 1
A 2
B 5
B 5
B 4
C 6