Pandas merge doesn't retain as many rows as I would think - python

Consider the following two data frames
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
Running
df3 = pd.merge(df1, df2, on='a')
Yields
a b c
0 foo 1 3
But why not the following?
a b c
0 foo 1 3
1 bar 2 -
1 baz - 4
What do I need to tell python to get it to output both rows?

A pandas merge does by default an inner join, if you are familiar with database joins. That means it only returns the rows that have a matching entry in both the left and right dataframe. For you, that is just 'foo'.
You can change that by setting the how argument. If you want all rows from both left, and right set it to outer, if you want to keep all from the left frame set it to leftand if you want to keep all from the right frame set it to right.

pd.merge(df1, df2, on='a', how='outer') will join on matching keys with all non matching keys returned as a new row will NaN filling in the blanks.
try here for an overview of different types of SQL style joins which merge uses as basis.

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to remove duplication of columns names using Pandas Merge function

When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')
Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04

How to match multiple columns from two dataframes that have different sizes?

One of the solutions that is similar is found in here where the asker only have a single dataframe and their requirements was to match a fixed string value:
result = df.loc[(df['Col1'] =='Team2') & (df['Col2']=='Medium'), 'Col3'].values[0]
However, the problem I encountered with the .loc method is that it requires the 2 dataframes to have the same size because it will only match values on the same row position of each dataframe. So if the orders of the rows are mixed in either of the dataframes, it will not work as expected.
Sample of this situation is shown below:
df1 - df1 = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df2 - df2 = pd.DataFrame({'a': [1, 3, 2], 'b': [4, 6, 5]})
Using, df1.loc[(df1['a'] == df2['a']) & (df1['b'] == df2['b']), 'Status'] = 'Included' will yield:
But i'm looking for something like this:
I have looked into methods such as .lookup but is deprecated as of December of the year 2020 (which also requires similar sized dataframes).
Use DataFrame.merge with indicator parameter for new column with this information, if need change values e.g. use numpy.where:
df = df1.merge(df2, indicator='status', how='left')
df['status'] = np.where(df['status'].eq('both'), 'included', 'not included')
print (df)
a b status
0 1 4 included
1 2 5 included
2 3 6 included

Python pandas merge with OR logic

I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]

Merging dataframes, dropping column and set index

I have two dataframes like this:
import pandas as pd
left = pd.DataFrame({'id1': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
right = pd.DataFrame({'ID2': ['a', 'c', 'd'], 'val2': [4, 5, 6]})
id1 val1
0 a 1
1 b 2
2 c 3
ID2 val2
0 a 4
1 c 5
2 d 6
I want to merge these two dataframes, doing an inner merge, drop ID2 and then also use id1 as a new index. My desired output looks like this:
val1 val2
id1
a 1 4
c 3 5
I currently do this as follows:
res = pd.merge(left, right, left_on='id1', right_on='ID2', how='inner').drop('ID2', axis=1).set_index('id1')
which gives me the desired output.
My question is whether there is already an option that allows me
a) to drop a key column when an inner merge is performed as there will then be two identical columns
and/or
b) to directly set the index to one of the key columns used for the merging process.
Is the way I do it now the way to go or is there anything smarter/a built-in for this already?
One option is to set the key columns as index before joining, this will keep only one key column as index in the result:
left.set_index("id1").join(right.set_index("ID2"), how = "inner")
You can use merge with parameters left_index and right_index, (how='inner' is omited because default value) but first set_index in both df:
res = pd.merge(left.set_index('id1'),
right.set_index('ID2'),
left_index=True,
right_index=True)
print (res)
val1 val2
a 1 4
c 3 5
Solution with concat, is necessary add parameter join for inner join:
res = pd.concat([left.set_index('id1'),
right.set_index('ID2')], axis=1, join='inner')
print (res)
val1 val2
a 1 4
c 3 5
out of the three solutions provided the 'merge' solution works fastest:
pd.merge(left.set_index('id1'), right.set_index('ID2'),left_index=True, right_index=True)
please see the speed comparison...the first answer i.e. the one using merge is fastest

Categories

Resources