Pandas dataframe left merge without reindexing - python

Wondering if there's a more intuitive way to merge dataframes
In[140]: df1 = pd.DataFrame(data=[[1,2],[3,4],[10,4],[5,6]], columns=['A','B'], index=[1,3,5,7])
In[141]: df1
Out[141]:
A B
1 1 2
3 3 4
5 10 4
7 5 6
In[142]: df2 = pd.DataFrame(data=[[1,5],[3,4],[10,3],[5,2]], columns=['A','C'], index=[0,2,4,6])
In[143]: df2
Out[143]:
A C
0 1 5
2 3 4
4 10 3
6 5 2
My desired merged should look like this
A B C
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2
The key is to retain the origin left dataframe index.
Left merge does not work because it reindexes
In[150]: pd.merge(df1, df2, how='left', on='A')
Out[150]:
A B C
0 1 2 5
1 3 4 4
2 10 4 3
3 5 6 2
After some trial and error, figured out this way that works but wonder if there's a more intuitive way to achieve the same.
In[151]: pd.merge(df1, df2, how='outer', on=['A'], right_index=True)
Out[151]:
A B C
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2

pd.merge(df1, df2, how='outer', on=['A'], right_index=True)
looks a little weird to me. It says let's join two tables on column A and also the index of the right table with nothing on the left table. I wonder why this works.
I would do something like this:
In [27]: df1['index'] = df1.index
In [28]: df2['index'] = df2.index
In [33]: df_merge = pd.merge(df1, df2, how='left', on=['A'])
In [34]: df_merge
Out[34]:
A B index_x C index_y
0 1 2 1 5 1
1 3 4 3 4 2
2 10 4 5 3 3
3 5 6 7 2 4
In [35]: df_merge = df_merge[['A', 'B', 'C', 'index_x']]
In [36]: df_merge
Out[36]:
A B C index_x
0 1 2 5 1
1 3 4 4 3
2 10 4 3 5
3 5 6 2 7
[4 rows x 4 columns]
In [37]: df_merge.set_index(['index_x'])
Out[37]:
A B C
index_x
1 1 2 5
3 3 4 4
5 10 4 3
7 5 6 2

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7
You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

difference between two dataframes in Pandas

I am trying to find difference between two dataframe and the resulting df should return the rows matching the first dataframe. Since id's 6,7 was not there in df2 so the count value is as it is.
My Two Dataframes
Resulting Dataframe:
Use sub with set_index for align DataFrames by id columns, add reindex for id only by df1.id:
df = (df1.set_index('id')
.sub(df2.set_index('id'), fill_value=0)
.reindex(df1['id'])
.astype(int)
.reset_index())
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Another solution with merge and left join, then subtract by sub with extracting count_ column by pop:
df = df1.merge(df2, on='id', how='left', suffixes=('','_'))
df['count'] = df['count'].sub(df.pop('count_'), fill_value=0).astype(int)
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Setup:
df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'count':[3,5,6,7,2,9,4]})
print (df1)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 6 9
6 7 4
df2 = pd.DataFrame({'id':[1,2,3,4,5,8,9],
'count':[3,5,6,7,2,4,2]})
print (df2)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 8 4
6 9 2
Use:
temp = pd.merge(df1, df2, how='left', on='id').fillna(0)
temp['count'] = temp['count_x'] - temp['count_y']
temp[['id', 'count']]
id count
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 0.0
5 6 9.0
6 7 4.0

How to re-order the columns based on another dataframe with the same columns but different order

I wonder if there is a handy method to order the columns of a dataframe based on another one that has the same columns but with different order. Or, do I have to make a loop to achieve this?
Try this:
df2 = df2[df1.columns]
Demo:
In [1]: df1 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('abcd'))
In [2]: df2 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('badc'))
In [3]: df1
Out[3]:
a b c d
0 8 3 9 6
1 0 6 4 7
2 7 2 0 7
3 0 5 1 8
4 6 2 5 4
In [4]: df2
Out[4]:
b a d c
0 3 8 0 4
1 7 7 4 2
2 2 7 3 8
3 2 4 9 6
4 3 4 7 1
In [5]: df2 = df2[df1.columns]
In [6]: df2
Out[6]:
a b c d
0 8 3 4 0
1 7 7 2 4
2 7 2 8 3
3 4 2 6 9
4 4 3 1 7
Alternative solution:
df2 = df2.reindex_axis(df1.columns, axis=1)
Note: Pandas reindex_axis is deprecated since version 0.21.0: Use reindex instead.
df2 = df2.reindex(df1.columns, axis=1)

Concatenate data frames by column values

How I can merge following two data frames on columns A and B:
df1
A B C
1 2 3
2 8 2
4 7 9
df2
A B C
5 6 7
2 8 9
And with result to get only results of those two matching rows.
df3
A B C
2 8 2
2 8 9
You can concatenate them and drop the ones that are not duplicated:
conc = pd.concat([df1, df2])
conc[conc.duplicated(subset=['A', 'B'], keep=False)]
Out:
A B C
1 2 8 2
1 2 8 9
If you have duplicates,
df1
Out:
A B C
0 1 2 3
1 2 8 2
2 4 7 9
3 4 7 9
4 2 8 5
df2
Out:
A B C
0 5 6 7
1 2 8 9
3 5 6 4
4 2 8 10
You can keep track of the duplicated ones via boolean arrays:
cols = ['A', 'B']
bool1 = df1[cols].isin(df2[cols].to_dict('l')).all(axis=1)
bool2 = df2[cols].isin(df1[cols].to_dict('l')).all(axis=1)
pd.concat([df1[bool1], df2[bool2]])
Out:
A B C
1 2 8 2
4 2 8 5
1 2 8 9
4 2 8 10
Solution with Index.intersection, then select values in both DataFrames by loc and last concat them together:
df1.set_index(['A','B'], inplace=True)
df2.set_index(['A','B'], inplace=True)
idx = df1.index.intersection(df2.index)
print (idx)
MultiIndex(levels=[[2], [8]],
labels=[[0], [0]],
names=['A', 'B'],
sortorder=0)
df = pd.concat([df1.loc[idx],df2.loc[idx]]).reset_index()
print (df)
A B C
0 2 8 2
1 2 8 9
Here is a less efficient method that should preserve duplicates, but involves two merge/joins
# create a merged DataFrame with variables C_x and C_y with the C values
temp = pd.merge(df1, df2, how='inner', on=['A', 'B'])
# join columns A and B to a stacked DataFrame with the Cs on index
temp[['A', 'B']].join(
pd.DataFrame({'C':temp[['C_x', 'C_y']].stack()
.reset_index(level=1, drop=True)})).reset_index(drop=True)
This returns
A B C
0 2 8 2
1 2 8 9

how to delete a duplicate column read from excel in pandas

Data in excel:
a b a d
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Code:
df= pd.io.excel.read_excel(r"sample.xlsx",sheetname="Sheet1")
df
a b a.1 d
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
how to delete the column a.1?
when pandas reads the data from excel it automatically changes the column name of 2nd a to a.1.
I tried df.drop("a.1",index=1) , this does not work.
I have a huge excel file which has duplicate names, and i am interested only in few of columns.
You need to pass axis=1 for drop to work:
In [100]:
df.drop('a.1', axis=1)
Out[100]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Or just pass a list of the cols of interest for column selection:
In [102]:
cols = ['a','b','d']
df[cols]
Out[102]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Also works with 'fancy indexing':
In [103]:
df.ix[:,cols]
Out[103]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
If you know the name of the column you want to drop:
df = df[[col for col in df.columns if col != 'a.1']]
and if you have several columns you want to drop:
columns_to_drop = ['a.1', 'b.1', ... ]
df = df[[col for col in df.columns if col not in columns_to_drop]]
Much more generally drop all duplicated columns
df= df.drop(df.filter(regex='\.\d').columns, axis=1)

Categories

Resources