Merging dataframes, dropping column and set index - python

I have two dataframes like this:
import pandas as pd
left = pd.DataFrame({'id1': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
right = pd.DataFrame({'ID2': ['a', 'c', 'd'], 'val2': [4, 5, 6]})
id1 val1
0 a 1
1 b 2
2 c 3
ID2 val2
0 a 4
1 c 5
2 d 6
I want to merge these two dataframes, doing an inner merge, drop ID2 and then also use id1 as a new index. My desired output looks like this:
val1 val2
id1
a 1 4
c 3 5
I currently do this as follows:
res = pd.merge(left, right, left_on='id1', right_on='ID2', how='inner').drop('ID2', axis=1).set_index('id1')
which gives me the desired output.
My question is whether there is already an option that allows me
a) to drop a key column when an inner merge is performed as there will then be two identical columns
and/or
b) to directly set the index to one of the key columns used for the merging process.
Is the way I do it now the way to go or is there anything smarter/a built-in for this already?

One option is to set the key columns as index before joining, this will keep only one key column as index in the result:
left.set_index("id1").join(right.set_index("ID2"), how = "inner")

You can use merge with parameters left_index and right_index, (how='inner' is omited because default value) but first set_index in both df:
res = pd.merge(left.set_index('id1'),
right.set_index('ID2'),
left_index=True,
right_index=True)
print (res)
val1 val2
a 1 4
c 3 5
Solution with concat, is necessary add parameter join for inner join:
res = pd.concat([left.set_index('id1'),
right.set_index('ID2')], axis=1, join='inner')
print (res)
val1 val2
a 1 4
c 3 5

out of the three solutions provided the 'merge' solution works fastest:
pd.merge(left.set_index('id1'), right.set_index('ID2'),left_index=True, right_index=True)

please see the speed comparison...the first answer i.e. the one using merge is fastest

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to remove duplication of columns names using Pandas Merge function

When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')
Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04

Python pandas merge with OR logic

I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]

Subsetting DataFrame based on column names of another DataFrame

I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?
If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.

Two dataframes into one

I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.

Categories

Resources