Pandas: merge dataframes without creating new columns - python

I've got 2 dataframes with identical columns:
df1 = pd.DataFrame([['Abe','1','True'],['Ben','2','True'],['Charlie','3','True']], columns=['Name','Number','Other'])
df2 = pd.DataFrame([['Derek','4','False'],['Ben','5','False'],['Erik','6','False']], columns=['Name','Number','Other'])
which give:
Name Number Other
0 Abe 1 True
1 Ben 2 True
2 Charlie 3 True
and
Name Number Other
0 Derek 4 False
1 Ben 5 False
2 Erik 6 False
I want an output dataframe that is an intersection of the two based on "Name":
output_df =
Name Number Other
0 Ben 2 True
1 Ben 5 False
I've tried a basic pandas merge but the return is non-desirable:
pd.merge(df1,df2,how='inner',on='Name') =
Name Number_x Other_x Number_y Other_y
0 Ben 2 True 5 False
These dataframes are quite large so I'd prefer to use some pandas magic to keep things quick.

You can use concat and then filter by isin with numpy.intersect1d using boolean indexing:
val = np.intersect1d(df1.Name, df2.Name)
print (val)
['Ben']
df = pd.concat([df1,df2], ignore_index=True)
print (df[df.Name.isin(val)])
Name Number Other
1 Ben 2 True
4 Ben 5 False
Another possible solution for val is intersection of sets:
val = set(df1.Name).intersection(set(df2.Name))
print (val)
{'Ben'}
Then is possible reset index to monotonic:
df = pd.concat([df1,df2])
print (df[df.Name.isin(val)].reset_index(drop=True))
Name Number Other
0 Ben 2 True
1 Ben 5 False

Related

How can I add a column to a dataframe with a value conditional on another dataframe?

I'm working with two dataframes:
Dataframe1 looks like:
user (index)
apples
bananas
Pete
4
2
Sara
5
10
Kara
4
2
Tom
3
3
Dataframe2 looks like:
index
user
1
Pete
2
Sara
I want to create a new boolean column in dataframe1 that is true if the user is in dataframe 2. So output looks like:
user
apples
bananas
new column
Pete
4
2
True
Sara
5
10
True
Kara
4
2
False
Tom
3
3
False
I tried using lambda function but didn't get very far.
Here is an easy way of doing that.
df = df.reset_index()
df2['new_column']=True
df = pd.merge(df, df2, left_on='user', right_on='user', how = 'left')
df.new_column.fillna(False, inplace=True)
You can leverage the indicator param of df.merge. Then use df.replace:
In [598]: x = df1.merge(df2['user'], left_on='user (index)', right_on='user', how='left', indicator='new column').replace({'both': True, 'left_only':False}).drop('user', 1)
In [599]: x
Out[599]:
user (index) apples bananas new column
0 Pete 4 2 True
1 Sara 5 10 True
2 Kara 4 2 False
3 Tom 3 3 False
OR:
For some better performance, use Series.map instead of df.replace:
In [609]: y = df1.merge(df2['user'], left_on='user (index)', right_on='user', how='left', indicator='new column').drop('user', 1)
In [611]: y['new column'] = y['new column'].map({'both': True, 'left_only':False})
In [612]: y
Out[612]:
user (index) apples bananas new column
0 Pete 4 2 True
1 Sara 5 10 True
2 Kara 4 2 False
3 Tom 3 3 False

How to copy column to another dataframe but allow duplicates

I have two Pandas dataframes, let's say df1 and df2. df1 contains data about what team members participated in some competition in what attempts. df2 contains data about how well did the team do in the attemps. I need to add a column to to df1 which would copy the data from df2.success to every row according to the team, like this:
df1
attempt team name
0 1 A Alice
1 1 A Bob
2 1 B Charlie
3 1 B Daniel
4 2 A Alice
5 2 A Bob
6 2 B Charlie
7 2 B Eva
df2
attempt team success
0 1 A True
1 1 B False
2 2 A False
3 2 B True
and the result should be
df1
attempt team name success
0 1 A Alice True
1 1 A Bob True
2 1 B Charlie False
3 1 B Daniel False
4 2 A Alice False
5 2 A Bob False
6 2 B Charlie True
7 2 B Eva True
The problem is that the dataframes have a different number of rows so I need to duplicate some data and I keep getting bunch of errors when trying to do this with loc.
df1['success'] = df2.loc[(df1["attempt"].values, df1["team"].values), ["success"]]['success'].values
df = df1.merge(df2, on=['attempt', 'team'], how='left')
Is that what you where looking for ?

Retain only duplicated rows in a pandas dataframe

I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?
Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]
Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1

Matching on basis of a pair of columns in pandas

I have a data frame df1 with multiple columns. I have df2 with same set of columns. I want to get the records of df1 which aren't present in df2. I am able to perform this task as below:
df1[~df1['ID'].isin(df2['ID'])]
Now I want to the same operation, but on the combination of NAME and ID. This means if the NAME and ID together as a pair from df1 also exists as the same pair in df2, then that whole record should not be part of my result.
How do I accomplish this task using pandas?
I don't think that the currently accepted answer is actually correct. It was my impression that you would like to drop a value pair in df1 if that pair also exists in the other dataframe, independent of the row position that they take in the respective dataframes.
Consider the following dataframes
df1 = pd.DataFrame({'a': list('ABC'), 'b': list('CDF')})
df2 = pd.DataFrame({'a': list('ABAC'), 'b': list('CFFF')})
df1
a b
0 A C
1 B D
2 C F
df2
a b
0 A C
1 B F
2 A F
3 C F
So you would like to drop row 0 and 2 in df1. However, with the above suggestion you get
df1.isin(df2)
a b
0 True True
1 True False
2 False True
What you can do instead is
compare_cols = ['a','b']
mask = pd.Series(list(zip(*[df1[c] for c in compare_cols]]))).isin(list(zip(*[df2[c] for c in compare_cols])))
mask
0 True
1 False
2 True
dtype: bool
That is, you construct a Series of tuples from the columns you would like to compare coming from the first dataframe, and then check whether these tuples exist in the list of tuples obtained in the same way from the respective columns in the second dataframe.
Final step: df1 = df1.loc[~mask.values]
As pointed out by #rvrvrv in the comments, it is best to use mask.values instead of just mask in case df1 and mask do not have the same index (or one uses the df1 index in the construction of mask.)
It's actually pretty easy.
df1[(~df1[['ID', 'Name']].isin(df2[['ID', 'Name']])).any(axis=1)]
You pass the column names that you want to compare as a list. The interesting part is what it outputs.
Let's say df1 equals:
ID Name
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 1 1
And df2 equals:
ID Name
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 1 9
Every (ID, Name) pair between df1 and df2 matches except for row 9. The result of my answer will return:
ID Name
9 1 1
Which is exactly what you want.
In more detail, when you do the mask:
~df[['ID', 'Name']].isin(df2[['ID', 'Name']]
You get this:
ID Name
0 False False
1 False False
2 False False
3 False False
4 False False
5 False False
6 False False
7 False False
8 False False
9 False True
And we want to select the row where one of those columns is true. For this, we can add the any(axis=1) onto the end which creates:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
And then when you index using this series, it will only select row 9.
Isin() would not work here as it is also comparing the index.
Let's have a look at a super powerful tool of pandas : merge()
If we consider the nice example given by user3820991, we have :
df1 = pd.DataFrame({'a': list('ABC'), 'b': list('CDF')})
df2 = pd.DataFrame({'a': list('ABAC'), 'b': list('CFFF')})
df1
a b
0 A C
1 B D
2 C F
df2
a b
0 A C
1 B F
2 A F
3 C F
The basic merge method of pandas is the 'inner' join. This will give you the equivalent of isin() method for two columns:
df1.merge(df2[['a','b']], how='inner')
a b
0 A C
1 C F
If you would like te equivalent of the not(isin()), then just change the merge method by 'outer' join (left join would work, but for the beauty of the example, you have more possibilities with the outer join).
This will give you all the rows in both dataframe, we only have to add the indicator=True to be able to select the one we want:
df1.merge(df2[['a','b']], how='outer', indicator=True)
a b _merge
0 A C both
1 B D left_only
2 C F both
3 B F right_only
4 A F right_only
We want the rows that are in df1 but not in df2, so 'left_only'. In a one liner code, you have :
pd.merge(df1, df2, on=['a','b'], how="outer", indicator=True
).query('_merge=="left_only"').drop(columns='_merge')
a b
1 B D
You can create a new column by concatenating NAME and ID and use this new column the same way you used ID in your question:
df1['temp'] = df1['NAME'].astype(str)+df1['ID'].astype(str)
df2['temp'] = df2['NAME'].astype(str)+df2['ID'].astype(str)
df1[~df1['temp'].isin(df2['temp'])].drop('temp',1)

Applying a specific function to replace value of column based on criteria from another column in dataframe

Here is what I am looking to do:
Dataframe before:
name value apply_f
0 SEBASTIEN 9 false
1 JOHN 4 false
2 JENNY np.inf true
Apply function f: len(df['name']) to columns 'value' only if columns 'apply_f' == True
Dataframe after:
name value apply_f
0 SEBASTIEN 9 False
1 JOHN 4 False
2 JENNY 5 True
Here's what I currently have:
from pandas import *
from numpy import *
df = DataFrame( { "name": ['SEBASTIEN', 'JOHN', 'JENNY'] ,
"value": [9, 4, np.inf] ,
"apply_f": [False,False,True]} )
def f(x):
return len(x)
df['value'] = df[df['apply_f'] == True]['name'].apply(f)
but result is not what I was expecting:
apply_f name value
0 False SEBASTIEN NaN
1 False JOHN NaN
2 True JENNY 5
The column replaces the initial values with NaN
The reason it overwrites is because the indexing on the left hand side is defaulting to the entire dataframe, if you apply the mask to the left hand also using loc then it only affects those rows where the condition is met:
In [272]:
df.loc[df['apply_f'] == True, 'value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
Out[272]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 5
The use of loc in the above is because say I used the same boolean mask semantics this may or may not work and will raised an error in the latest pandas versions:
In[274]:
df[df['apply_f'] == True]['value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
-c:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Out[274]:
apply_f name value
0 False SEBASTIEN 9.000000
1 False JOHN 4.000000
2 True JENNY inf
For what you are doing it would be more concise and readable to use numpy where:
In [279]:
df['value'] = np.where(df['apply_f']==True, len(df['name']), df['value'])
df
Out[279]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 3
I understand that your example is to demonstrate an issue but you can also use where for certain situations.

Categories

Resources