Join pandas tables, keeping one index - python

What I want to do is join two dataframes on columns and keep the index of one of them (but the index is unrelated to whether I join them or not).
For example, if df1 is the dataframe that has certain timestamps as its index that I would like to keep, then to join with df2 on the 'key' column, my expected code would be
df3 = pd.merge(df1, df2, on='key', left_index=True)
I would then expect df3 to have all rows of df1 and df2 where df3[key] == df1[key] == df2[key] and df3[key].index == df1[key].index.
However, this is not the case. In fact, you find that the index of df3 is actually the index of df2. The reverse is true for right_index=True.
I've considered submitting a bug report, but rereading the documentation leads me to believe that (while completely counter intuitive) this may not be incorrect behavior.
What is the proper way to join two tables, keeping one of the indices?
EDIT:
I am doing an inner join on 'key'. That is not the issue. The issue is that I want the resulting rows to have the index of one of the dataframes.
For example, if I have the following sets of data in two dataframes:
df1 = pd.DataFrame(np.arange(4).reshape(2,2))
df2 = pd.DataFrame(np.arange(4).reshape(2,2), columns=[0,2])
df2.index = df2.index.map(lambda x: x + 10)
That is,
>>> df1
0 1
0 0 1
1 2 3
>>> df2
0 2
10 0 1
11 2 3
I can run pd.merge(df1, df2, on=0) which (as expected) yields
>>> pd.merge(df1,df2,on=0)
0 1 2
0 0 1 1
1 2 3 3
Notice, however, that df2 has a different index. In my actual data, this is timestamp data that I want to keep. It isn't used in the joining at all, but it does need to persist. I could just add a column to df2 to keep it around, but that isn't what I want to do. :)
What I would like is to do something like pd.merge(df1, df2, on=0, right_index=True) and receive:
0 1 2
10 0 1 1
11 2 3 3
However, I actually get the opposite of this:
>>> pd.merge(df1,df2,on=0,right_index=True)
0 1 2
0 0 1 1
1 2 3 3
while reversing them inexplicably works.
>>> pd.merge(df1,df2,on=0,left_index=True)
0 1 2
10 0 1 1
11 2 3 3

I think what your looking for is akin to Full Outer Join in SQL, in which case I think the following would work:
df3 = pd.merge(df1, df2, on='key', how='outer')
As for keeping just one index, that should be done automatically in this case now that outer join is keeping all keys.
Using your example:
In [4]: df1['key'] = df1.index
In [5]: df2['key'] = df2.index
In [6]: df3 = pd.merge(df1, df2, on='key', how='outer')
In [7]: df3
Out[7]:
0_x 1 key 0_y 2
0 0 1 0 NaN NaN
1 2 3 1 NaN NaN
2 NaN NaN 10 0 1
3 NaN NaN 11 2 3
So in this case a new index is created, but could be re-assigned the original values from 'key' if desired.

Related

Combine (merge/join/concat) two dataframes by mask (leave only first matches) in pandas [python]

I have df1:
match
0 a
1 a
2 b
And I have df2:
match number
0 a 1
1 b 2
2 a 3
3 a 4
I want to combine these two dataframes so that only first matches remained, like this:
match_df1 match_df2 number
0 a a 1
1 a a 3
2 b b 2
I've tried different combinations of inner join, merge and pd.concat, but nothing gave me anything close to the desired output. Is there any pythonic way to make it without any loops, just with pandas methods?
Update:
For now came up with this solution. Not sure if it's the most efficient. Your help would be appreciated!
df = pd.merge(df1, df2, on='match').drop_duplicates('number')
for match, count in df1['match'].value_counts().iteritems():
df = df.drop(index=df[df['match'] == match][count:].index)
In your case you can do with groupby and cumcount before merge ,Notice I do not keep two match columns since they are the same
df1['key'] = df1.groupby('match').cumcount()
df2['key'] = df2.groupby('match').cumcount()
out = df1.merge(df2)
Out[418]:
match key number
0 a 0 1
1 a 1 3
2 b 0 2

Pandas Inner Join with axis is one

I was working with inner join using concat in pandas.
Using two DataFrames as below:-
df1 = pd.DataFrame([['a',1],['b',2]], columns=['letter','number'])
df3 = pd.DataFrame([['c',3,'cat'],['d',4,'dog']],
columns=['letter','number','animal'])
pd.concat([df1,df3], join='inner')
The out is below
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But after using axis=1 the output is as below
pd.concat([df1,df3], join='inner', axis=1)
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
Why it is showing animal column while doing inner join when axis=1?
In Pandas.concat()
axis argument defines whether to concat the dataframes based on index or columns.
axis=0 // based on index (default value)
axis=1 // based on columns
when you Concatenated df1 and df3, it uses index to combine dataframes and thus output is
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But when you used axis=1, pandas combined the data based on columns.
thats why the output is
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
EDIT:
you asked But inner join only join same column right? Then why it is showing 'animal' column?
So, Because right now you have 2 rows in both the dataframes and join only works in indexes.
For explaining to you, I have added another row in df3
Let's suppose df3 is
0 1 2
0 c 3 cat
1 d 4 dog
2 e 5 bird
Now, If you concat the df1 and df3
pd.concat([df1,df3], join='inner', axis=1)
letter number 0 1 2
0 a 1 c 3 cat
1 b 2 d 4 dog
pd.concat([df1,df3], join='outer', axis=1)
letter number 0 1 2
0 a 1.0 c 3 cat
1 b 2.0 d 4 dog
2 NaN NaN e 5 bird
As you can see, in inner join only 0 and 1 indexes are in output
but in outer join, all the indexes are in output with NAN values.
The default value of the axis is 0. So in the first concat call axis=0 and over there concatenation happens in rows. When you set axis=1 the operation is similar to
df1.merge( df3, how="inner", left_index=True, right_index=True)

Compare 2 pandas.DataFrames, get differences and print only rows that changed from the first one

I have 2 dataframes which I am comparing with below snippet:
df3 = pandas.concat([df1, df2]).drop_duplicates(keep=False)
It works fine, it compares both and as an output I got rows that are different form both of them.
What I would like to achieve is to compare 2 dataframes to get rows that are different but as an output only get/keep rows from the first DataFrame.
Is there an easy way to do this?
I would use ~isin():
df.set_index(list(df.columns), inplace=True)
df2.set_index(list(df2.columns), inplace=True)
df[~df.index.isin(df2.index)].reset_index()
If you only want the unique rows from the first dataframe, you really want a left join.
df3 = df1.merge(df2.drop_duplicates(), on='your_column_here',
how='left', indicator=True)
Now you can check the _merge column and filter for left only:
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
One way is to pre-mark the df's rows with a number (like .assign(mark=1)) and drop the helper column after
df1 = pd.DataFrame(np.random.randint(-10, 10, 20)) # dummy data
df2 = pd.DataFrame(np.random.randint(-10, 10, 20)) # dummy data
df3 = pd.concat([df1.assign(mark=1), df2.assign(mark=2)]).drop_duplicates(keep=False)
print(df3[df3['mark'].eq(1)].drop(columns='mark'))
Prints:
0
2 -6
3 -8
14 3
16 -3

How to compare two datasets and extract the differences between them in python?

I have two datasets with the same attributes but in some of the rows the information is changed. I want to extract the rows in which the information has been changed.
pandas offers a rich API that you can use to manipulate data whatever you want. merge method is one of them. merge is high performance in-memory join operations idiomatically very similar to relational databases like SQL.
df1 = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
print(df1)
A B
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'A':[1,10,3],'B':[4,5,6]})
print(df2)
A B
0 1 4
1 10 5
2 3 6
df3 = df1.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
A B _merge
0 1 4 both
1 3 6 both
2 10 5 right_only
as you can see there is new column named _merge has description of how this row merged, both means that this row exists in both data frames , where right_only means that this row exist only on right data frame which in this case is df2
if you want to get only row had been changed than you can filter on _merge column
df3 = df3[df3['_merge']=='right_only']
A B _merge
2 10 5 right_only
Note: you can do merge using left join by change how argument to left and this will
grap every thing in left data frame (df1) and if row exists in df2 as well then _merge column will show up both. take a look at here for more details

Consolidating dataframes

I have 3 pandas data frames with matching indices. Some operations have trimmed data frames in different ways (removed rows), so that some indices in one data frame may not exist in the other.
I'd like to consolidate all 3 data frames, so they all contain rows with indices that are present in all 3 of them. How is this achievable?
import pandas as pd
data = pd.DataFrame.from_dict({'a': [1,2,3,4], 'b': [3,4,5,6], 'c': [6,7,8,9]})
a = pd.DataFrame(data['a'])
b = pd.DataFrame(data['b'])
c = pd.DataFrame(data['c'])
a = a[a['a'] <= 3]
b = b[b['b'] >= 4]
# some operation here that removes rows that aren't present in all (intersection of all dataframe's indices)
print a
a
1 2
2 3
print b
b
1 4
2 5
print c
c
1 7
2 8
Update
Sorry, I got carried away and forgot what I wanted to achieve when I wrote the examples. The actual intent was to keep the 3 dataframes separate. Apologies for the misleading example (I corrected it now).
Use merge and pass param left_index=True and right_index=True, the default type of merge is inner, so only values that exist on both left and right will be merged.
In [6]:
a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
Out[6]:
a b c
1 2 4 7
2 3 5 8
[2 rows x 3 columns]
To modify the original dataframes so that now only contain the rows that exist in all you can do this:
In [12]:
merged = a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
merged
Out[12]:
a b c
1 2 4 7
2 3 5 8
In [14]:
a = a.loc[merged.index]
b = b.loc[merged.index]
c = c.loc[merged.index]
In [15]:
print(a)
print(b)
print(c)
a
1 2
2 3
b
1 4
2 5
c
1 7
2 8
So we merge all of them on index values that are present in all of them and then use the index to filter the original dataframes.
Take a look at concat, which can be used for a variety of combination operations. Here you want to have the join type set to inner (because the want the intersection), and axis set to 1 (combining columns).
In [123]: pd.concat([a,b,c], join='inner', axis=1)
Out[123]:
a b c
1 2 4 7
2 3 5 8

Categories

Resources