Pandas merge creates unwanted duplicate entries - python

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.
import pandas as pd
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
print(pd.merge(df1,df2))
output:
A
0 2
1 2
2 2
3 2
4 3
5 4
6 5
desired/expected output:
A
0 2
1 2
2 3
3 4
4 5
Please let me know how/if I can achieve the desired output using merge, thank you!
EDIT
To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?
import pandas as pd
df1 = df2 = pd.DataFrame(
{'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)
print(pd.merge(df1,df2))
output:
A B
0 2 red
1 2 orange
2 3 yellow
3 4 green
4 5 blue
However, based on the first example I would expect:
A B
0 2 red
1 2 orange
2 2 red
3 2 orange
4 3 yellow
5 4 green
6 5 blue

import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()
df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
print(df)
Output:
A
0 2
1 2
2 3
3 4
4 5

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2).drop('index', 1, inplace = True)
The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.
If you try this, you will see what I am talking about.
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2, on = 'A')

did you try df.drop_duplicates() ?
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df=pd.merge(df1,df2)
df_new=df.drop_duplicates()
print df
print df_new
Seems that it gives the results that you want

The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
join_cols = ['A']
merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False], on=join_cols)
Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.

I have unfortunately stumbled upon a similar problem which I see is now old.
I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()
df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()
df=pd.merge(df1,df2)
print('df1:')
print( df1 )
print('df2:')
print( df2 )
print('df:')
print( df )

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

combining two dataframes into one new dataframe in a zig zag/zipper way

I have df1 and df2, i want to create new data frame df3, such that the first record of df3 should be first record from df1, second record of df3 should be first record of df2. and it continues in the similar manner.
I tried many methods with pandas, but didn't get answer.
Is there any ways to achieve it.
You can create a column with incremental id (one with odd numbers and other with even numbers:
import numpy as np
df1['unique_id'] = np.arange(0, df1.shape[0]*2,2)
df2['unique_id'] = np.arange(1, df2.shape[0]*2,2)
and then append them and sort by this column:
df3 = df1.append(df2)
df3 = df3.sort_values(by=['unique_id'])
after which you can drop the column you created:
df3 = df3.drop(columns=['unique_id'])
You could do it this way:
import pandas as pd
df1 = pd.DataFrame({'A':[3,3,4,6], 'B':['a1','b1','c1','d1']})
df2 = pd.DataFrame({'A':[5,4,6,1], 'B':['a2','b2','c2','d2']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
print(pd.concat([df1, df2]).sort_index(kind='merge'))
Which gives
A B
0 3 a1
0 5 a2
1 3 b1
1 4 b2
2 4 c1
2 6 c2
3 6 d1
3 1 d2

assignment with df.iloc() returns nan

I created a dataframe df = pd.DataFrame({'col':[1,2,3,4,5,6]}) and I would like to take some values and put them in another dataframe df2 = pd.DataFrame({'A':[0,0]})by creating new columns.
I created a new column 'B' df2['B'] = df.iloc[0:2,0] and everything was fine, but then i created another column C df2['C'] = df.iloc[2:4,0] and there were only NaN values. I don't know why and if I print print(df.iloc[2:4]) everything is normal.
full code:
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0]
print(df2)
print('\n',df.iloc[2:4])
output:
A B C
0 0 1 NaN
1 0 2 NaN
col
2 3
3 4
Assignement df2['C'] = df.iloc[2:4,0] does not work as expected, because index is not the same. You can skip this using .values attributes.
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0].values
print(df2)

How to merge many DataFrames by index combining values where columns overlap?

I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df

Subsetting DataFrame based on column names of another DataFrame

I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?
If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.

Categories

Resources