Concatenate data frames by column values - python

How I can merge following two data frames on columns A and B:
df1
A B C
1 2 3
2 8 2
4 7 9
df2
A B C
5 6 7
2 8 9
And with result to get only results of those two matching rows.
df3
A B C
2 8 2
2 8 9

You can concatenate them and drop the ones that are not duplicated:
conc = pd.concat([df1, df2])
conc[conc.duplicated(subset=['A', 'B'], keep=False)]
Out:
A B C
1 2 8 2
1 2 8 9
If you have duplicates,
df1
Out:
A B C
0 1 2 3
1 2 8 2
2 4 7 9
3 4 7 9
4 2 8 5
df2
Out:
A B C
0 5 6 7
1 2 8 9
3 5 6 4
4 2 8 10
You can keep track of the duplicated ones via boolean arrays:
cols = ['A', 'B']
bool1 = df1[cols].isin(df2[cols].to_dict('l')).all(axis=1)
bool2 = df2[cols].isin(df1[cols].to_dict('l')).all(axis=1)
pd.concat([df1[bool1], df2[bool2]])
Out:
A B C
1 2 8 2
4 2 8 5
1 2 8 9
4 2 8 10

Solution with Index.intersection, then select values in both DataFrames by loc and last concat them together:
df1.set_index(['A','B'], inplace=True)
df2.set_index(['A','B'], inplace=True)
idx = df1.index.intersection(df2.index)
print (idx)
MultiIndex(levels=[[2], [8]],
labels=[[0], [0]],
names=['A', 'B'],
sortorder=0)
df = pd.concat([df1.loc[idx],df2.loc[idx]]).reset_index()
print (df)
A B C
0 2 8 2
1 2 8 9

Here is a less efficient method that should preserve duplicates, but involves two merge/joins
# create a merged DataFrame with variables C_x and C_y with the C values
temp = pd.merge(df1, df2, how='inner', on=['A', 'B'])
# join columns A and B to a stacked DataFrame with the Cs on index
temp[['A', 'B']].join(
pd.DataFrame({'C':temp[['C_x', 'C_y']].stack()
.reset_index(level=1, drop=True)})).reset_index(drop=True)
This returns
A B C
0 2 8 2
1 2 8 9

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7
You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

selecting rows of one dataframe using multiple columns of another dataframe in python, pandas

I want to pick only rows from df1 where both values of columns A and B in df1 match values of columns A and B in df2 so for example if df 1 and df2 are as follow:
df1
A B C
1 2 3
4 5 6
6 7 8
df2
A B D E
1 2 6 8
2 3 7 9
4 5 2 1
the result will be a subset of df1 rows, in this example, result will look like:
df1
A B C
1 2 3
4 5 6
Use:
df = pd.merge(df1, df2[["A", "B"]], on=["A", "B"], how="inner")
print(df)
This prints:
A B C
0 1 2 3
1 4 5 6

How to re-order the columns based on another dataframe with the same columns but different order

I wonder if there is a handy method to order the columns of a dataframe based on another one that has the same columns but with different order. Or, do I have to make a loop to achieve this?
Try this:
df2 = df2[df1.columns]
Demo:
In [1]: df1 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('abcd'))
In [2]: df2 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('badc'))
In [3]: df1
Out[3]:
a b c d
0 8 3 9 6
1 0 6 4 7
2 7 2 0 7
3 0 5 1 8
4 6 2 5 4
In [4]: df2
Out[4]:
b a d c
0 3 8 0 4
1 7 7 4 2
2 2 7 3 8
3 2 4 9 6
4 3 4 7 1
In [5]: df2 = df2[df1.columns]
In [6]: df2
Out[6]:
a b c d
0 8 3 4 0
1 7 7 2 4
2 7 2 8 3
3 4 2 6 9
4 4 3 1 7
Alternative solution:
df2 = df2.reindex_axis(df1.columns, axis=1)
Note: Pandas reindex_axis is deprecated since version 0.21.0: Use reindex instead.
df2 = df2.reindex(df1.columns, axis=1)

how to delete a duplicate column read from excel in pandas

Data in excel:
a b a d
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Code:
df= pd.io.excel.read_excel(r"sample.xlsx",sheetname="Sheet1")
df
a b a.1 d
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
how to delete the column a.1?
when pandas reads the data from excel it automatically changes the column name of 2nd a to a.1.
I tried df.drop("a.1",index=1) , this does not work.
I have a huge excel file which has duplicate names, and i am interested only in few of columns.
You need to pass axis=1 for drop to work:
In [100]:
df.drop('a.1', axis=1)
Out[100]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Or just pass a list of the cols of interest for column selection:
In [102]:
cols = ['a','b','d']
df[cols]
Out[102]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Also works with 'fancy indexing':
In [103]:
df.ix[:,cols]
Out[103]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
If you know the name of the column you want to drop:
df = df[[col for col in df.columns if col != 'a.1']]
and if you have several columns you want to drop:
columns_to_drop = ['a.1', 'b.1', ... ]
df = df[[col for col in df.columns if col not in columns_to_drop]]
Much more generally drop all duplicated columns
df= df.drop(df.filter(regex='\.\d').columns, axis=1)

Pandas dataframe left merge without reindexing

Wondering if there's a more intuitive way to merge dataframes
In[140]: df1 = pd.DataFrame(data=[[1,2],[3,4],[10,4],[5,6]], columns=['A','B'], index=[1,3,5,7])
In[141]: df1
Out[141]:
A B
1 1 2
3 3 4
5 10 4
7 5 6
In[142]: df2 = pd.DataFrame(data=[[1,5],[3,4],[10,3],[5,2]], columns=['A','C'], index=[0,2,4,6])
In[143]: df2
Out[143]:
A C
0 1 5
2 3 4
4 10 3
6 5 2
My desired merged should look like this
A B C
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2
The key is to retain the origin left dataframe index.
Left merge does not work because it reindexes
In[150]: pd.merge(df1, df2, how='left', on='A')
Out[150]:
A B C
0 1 2 5
1 3 4 4
2 10 4 3
3 5 6 2
After some trial and error, figured out this way that works but wonder if there's a more intuitive way to achieve the same.
In[151]: pd.merge(df1, df2, how='outer', on=['A'], right_index=True)
Out[151]:
A B C
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2
pd.merge(df1, df2, how='outer', on=['A'], right_index=True)
looks a little weird to me. It says let's join two tables on column A and also the index of the right table with nothing on the left table. I wonder why this works.
I would do something like this:
In [27]: df1['index'] = df1.index
In [28]: df2['index'] = df2.index
In [33]: df_merge = pd.merge(df1, df2, how='left', on=['A'])
In [34]: df_merge
Out[34]:
A B index_x C index_y
0 1 2 1 5 1
1 3 4 3 4 2
2 10 4 5 3 3
3 5 6 7 2 4
In [35]: df_merge = df_merge[['A', 'B', 'C', 'index_x']]
In [36]: df_merge
Out[36]:
A B C index_x
0 1 2 5 1
1 3 4 4 3
2 10 4 3 5
3 5 6 2 7
[4 rows x 4 columns]
In [37]: df_merge.set_index(['index_x'])
Out[37]:
A B C
index_x
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2

Categories

Resources