I was working with inner join using concat in pandas.
Using two DataFrames as below:-
df1 = pd.DataFrame([['a',1],['b',2]], columns=['letter','number'])
df3 = pd.DataFrame([['c',3,'cat'],['d',4,'dog']],
columns=['letter','number','animal'])
pd.concat([df1,df3], join='inner')
The out is below
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But after using axis=1 the output is as below
pd.concat([df1,df3], join='inner', axis=1)
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
Why it is showing animal column while doing inner join when axis=1?
In Pandas.concat()
axis argument defines whether to concat the dataframes based on index or columns.
axis=0 // based on index (default value)
axis=1 // based on columns
when you Concatenated df1 and df3, it uses index to combine dataframes and thus output is
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But when you used axis=1, pandas combined the data based on columns.
thats why the output is
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
EDIT:
you asked But inner join only join same column right? Then why it is showing 'animal' column?
So, Because right now you have 2 rows in both the dataframes and join only works in indexes.
For explaining to you, I have added another row in df3
Let's suppose df3 is
0 1 2
0 c 3 cat
1 d 4 dog
2 e 5 bird
Now, If you concat the df1 and df3
pd.concat([df1,df3], join='inner', axis=1)
letter number 0 1 2
0 a 1 c 3 cat
1 b 2 d 4 dog
pd.concat([df1,df3], join='outer', axis=1)
letter number 0 1 2
0 a 1.0 c 3 cat
1 b 2.0 d 4 dog
2 NaN NaN e 5 bird
As you can see, in inner join only 0 and 1 indexes are in output
but in outer join, all the indexes are in output with NAN values.
The default value of the axis is 0. So in the first concat call axis=0 and over there concatenation happens in rows. When you set axis=1 the operation is similar to
df1.merge( df3, how="inner", left_index=True, right_index=True)
Related
I have df1:
match
0 a
1 a
2 b
And I have df2:
match number
0 a 1
1 b 2
2 a 3
3 a 4
I want to combine these two dataframes so that only first matches remained, like this:
match_df1 match_df2 number
0 a a 1
1 a a 3
2 b b 2
I've tried different combinations of inner join, merge and pd.concat, but nothing gave me anything close to the desired output. Is there any pythonic way to make it without any loops, just with pandas methods?
Update:
For now came up with this solution. Not sure if it's the most efficient. Your help would be appreciated!
df = pd.merge(df1, df2, on='match').drop_duplicates('number')
for match, count in df1['match'].value_counts().iteritems():
df = df.drop(index=df[df['match'] == match][count:].index)
In your case you can do with groupby and cumcount before merge ,Notice I do not keep two match columns since they are the same
df1['key'] = df1.groupby('match').cumcount()
df2['key'] = df2.groupby('match').cumcount()
out = df1.merge(df2)
Out[418]:
match key number
0 a 0 1
1 a 1 3
2 b 0 2
I have two Dataframes.
print(df1)
key value
0 A 2
1 B 3
2 C 2
3 D 3
print(df2)
key value
0 B 3
1 D 1
2 E 1
3 F 3
What I want is for it to do a outer merge on key and pick whichever value is not NaN.
Which one it choses if both are int (or float) is not that important. The mean would be a nice touch though.
print(df3)
key value
0 A 2
1 B 3
3 C 2
4 D 2
5 E 1
6 F 3
I tried:
df3 = df1.merge(df2, on='key', how='outer')
but it generates 2 new columns. I could just do my calculations after, but am sure there is an easier solution, that I just could not find.
Thanks for your help.
This works for me, the duplicates are dropped in order of the dataframe entry, so the dupes from df1 are dropped and df2 are kept, if any keys don't match the duplicate key or both happen to be na we can drop them .dropna()
dfs = pd.concat([df1,df2]).drop_duplicates(subset=['key'],keep='last').dropna(how='any')
key value
0 A 2
2 C 2
3 D 3
0 B 3
1 D 1
2 E 1
3 F 3
I am using python to merge two dataframe:
join=pd.merge(df1,df2,on=["A","B"],how="left")
Table 1:
A B
a 1
b 2
c 3
Table 2:
A B Flag C
a 1 0 20
b 2 1 40
c 3 0 60
a 1 1 80
b 2 0 10
The result that I get after left join is:
A B Flag C
a 1 0 20
a 1 1 80
b 2 1 40
b 2 0 10
c 3 0 60
Here we see row 1 and row 2 has come twice because of table 2. I want to keep just one row based on Flag column. I want to keep one of the two rows whose Falg value is `= 1
So Final Expected output is:
A B Flag C
a 1 1 80
b 2 1 40
c 3 0 60
Is there any pythonic way to do it?
# raise preferred lines to the top
df2 = df2.sort_values(by='Flag', ascending=False)
# deduplicate
df2 = df2.drop_duplicates(subset=['A','B'], keep='first')
# merge
pd.merge(df1, df2, on=['A','B'])
A B Flag C
0 a 1 1 80
1 b 2 1 40
2 c 3 0 60
The concept is similar to what you would do on SQL: separate a table with the selection criterea (in this case maximums for flag), leaving enough columns to match an observation on the joint table.
join = pd.merge(df1, df2, how="left").reset_index()
maximums = join.groupby(by='A').max()
join = pd.merge(join, maximums, on=['Flag', 'A'])
Try using this join:
join=pd.merge(df1,df2,on=["A","B"],how="left", left_index=True, right_index=True)
print(join)
What I want to do is join two dataframes on columns and keep the index of one of them (but the index is unrelated to whether I join them or not).
For example, if df1 is the dataframe that has certain timestamps as its index that I would like to keep, then to join with df2 on the 'key' column, my expected code would be
df3 = pd.merge(df1, df2, on='key', left_index=True)
I would then expect df3 to have all rows of df1 and df2 where df3[key] == df1[key] == df2[key] and df3[key].index == df1[key].index.
However, this is not the case. In fact, you find that the index of df3 is actually the index of df2. The reverse is true for right_index=True.
I've considered submitting a bug report, but rereading the documentation leads me to believe that (while completely counter intuitive) this may not be incorrect behavior.
What is the proper way to join two tables, keeping one of the indices?
EDIT:
I am doing an inner join on 'key'. That is not the issue. The issue is that I want the resulting rows to have the index of one of the dataframes.
For example, if I have the following sets of data in two dataframes:
df1 = pd.DataFrame(np.arange(4).reshape(2,2))
df2 = pd.DataFrame(np.arange(4).reshape(2,2), columns=[0,2])
df2.index = df2.index.map(lambda x: x + 10)
That is,
>>> df1
0 1
0 0 1
1 2 3
>>> df2
0 2
10 0 1
11 2 3
I can run pd.merge(df1, df2, on=0) which (as expected) yields
>>> pd.merge(df1,df2,on=0)
0 1 2
0 0 1 1
1 2 3 3
Notice, however, that df2 has a different index. In my actual data, this is timestamp data that I want to keep. It isn't used in the joining at all, but it does need to persist. I could just add a column to df2 to keep it around, but that isn't what I want to do. :)
What I would like is to do something like pd.merge(df1, df2, on=0, right_index=True) and receive:
0 1 2
10 0 1 1
11 2 3 3
However, I actually get the opposite of this:
>>> pd.merge(df1,df2,on=0,right_index=True)
0 1 2
0 0 1 1
1 2 3 3
while reversing them inexplicably works.
>>> pd.merge(df1,df2,on=0,left_index=True)
0 1 2
10 0 1 1
11 2 3 3
I think what your looking for is akin to Full Outer Join in SQL, in which case I think the following would work:
df3 = pd.merge(df1, df2, on='key', how='outer')
As for keeping just one index, that should be done automatically in this case now that outer join is keeping all keys.
Using your example:
In [4]: df1['key'] = df1.index
In [5]: df2['key'] = df2.index
In [6]: df3 = pd.merge(df1, df2, on='key', how='outer')
In [7]: df3
Out[7]:
0_x 1 key 0_y 2
0 0 1 0 NaN NaN
1 2 3 1 NaN NaN
2 NaN NaN 10 0 1
3 NaN NaN 11 2 3
So in this case a new index is created, but could be re-assigned the original values from 'key' if desired.
I have 3 pandas data frames with matching indices. Some operations have trimmed data frames in different ways (removed rows), so that some indices in one data frame may not exist in the other.
I'd like to consolidate all 3 data frames, so they all contain rows with indices that are present in all 3 of them. How is this achievable?
import pandas as pd
data = pd.DataFrame.from_dict({'a': [1,2,3,4], 'b': [3,4,5,6], 'c': [6,7,8,9]})
a = pd.DataFrame(data['a'])
b = pd.DataFrame(data['b'])
c = pd.DataFrame(data['c'])
a = a[a['a'] <= 3]
b = b[b['b'] >= 4]
# some operation here that removes rows that aren't present in all (intersection of all dataframe's indices)
print a
a
1 2
2 3
print b
b
1 4
2 5
print c
c
1 7
2 8
Update
Sorry, I got carried away and forgot what I wanted to achieve when I wrote the examples. The actual intent was to keep the 3 dataframes separate. Apologies for the misleading example (I corrected it now).
Use merge and pass param left_index=True and right_index=True, the default type of merge is inner, so only values that exist on both left and right will be merged.
In [6]:
a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
Out[6]:
a b c
1 2 4 7
2 3 5 8
[2 rows x 3 columns]
To modify the original dataframes so that now only contain the rows that exist in all you can do this:
In [12]:
merged = a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
merged
Out[12]:
a b c
1 2 4 7
2 3 5 8
In [14]:
a = a.loc[merged.index]
b = b.loc[merged.index]
c = c.loc[merged.index]
In [15]:
print(a)
print(b)
print(c)
a
1 2
2 3
b
1 4
2 5
c
1 7
2 8
So we merge all of them on index values that are present in all of them and then use the index to filter the original dataframes.
Take a look at concat, which can be used for a variety of combination operations. Here you want to have the join type set to inner (because the want the intersection), and axis set to 1 (combining columns).
In [123]: pd.concat([a,b,c], join='inner', axis=1)
Out[123]:
a b c
1 2 4 7
2 3 5 8