How to copy column to another dataframe but allow duplicates - python

I have two Pandas dataframes, let's say df1 and df2. df1 contains data about what team members participated in some competition in what attempts. df2 contains data about how well did the team do in the attemps. I need to add a column to to df1 which would copy the data from df2.success to every row according to the team, like this:
df1
attempt team name
0 1 A Alice
1 1 A Bob
2 1 B Charlie
3 1 B Daniel
4 2 A Alice
5 2 A Bob
6 2 B Charlie
7 2 B Eva
df2
attempt team success
0 1 A True
1 1 B False
2 2 A False
3 2 B True
and the result should be
df1
attempt team name success
0 1 A Alice True
1 1 A Bob True
2 1 B Charlie False
3 1 B Daniel False
4 2 A Alice False
5 2 A Bob False
6 2 B Charlie True
7 2 B Eva True
The problem is that the dataframes have a different number of rows so I need to duplicate some data and I keep getting bunch of errors when trying to do this with loc.
df1['success'] = df2.loc[(df1["attempt"].values, df1["team"].values), ["success"]]['success'].values

df = df1.merge(df2, on=['attempt', 'team'], how='left')
Is that what you where looking for ?

Related

How to find common values in groupby groups?

I have a df of this format, my goal is to find users who participate in more than one tournament and ultimately set their 'val' value to the one they first appear with. Initially, I was thinking I need to groupby 'tour' but then it needs some intersection but I'm not sure how to proceed. Alternatively, I can do pd.crosstab(df.user, df.tour) but I'm not sure how to proceed either.
df = pd.DataFrame(data = [['jim','1','1', 10],['john','1','1', 12], ['jack','2', '1', 14],['jim','2','1', 10],
['mel','3','2', 20],['jim','3','2', 10],['mat','4','2', 14],['nick','4','2', 20],
['tim','5','3', 16],['john','5','3', 10],['lin','6','3', 16],['mick','6','3', 20]],
columns = ['user', 'game', 'tour', 'val'])
Since df is already sorted by tour, we could use groupby + first:
df['val'] = df.groupby('user')['val'].transform('first')
Output:
user game tour val
0 jim 1 1 10
1 john 1 1 12
2 jack 2 1 14
3 jim 2 1 10
4 mel 3 2 20
5 jim 3 2 10
6 mat 4 2 14
7 nick 4 2 20
8 tim 5 3 16
9 john 5 3 12
10 lin 6 3 16
11 mick 6 3 20
You can groupby on 'user' and filter out groups with only 1 element, and then select the first one, like so:
df.groupby(['user']).filter(lambda g:len(g)>1).groupby('user').head(1)
output
user game tour val
0 jim 1 1 10
1 john 1 1 12

How can I add a column to a dataframe with a value conditional on another dataframe?

I'm working with two dataframes:
Dataframe1 looks like:
user (index)
apples
bananas
Pete
4
2
Sara
5
10
Kara
4
2
Tom
3
3
Dataframe2 looks like:
index
user
1
Pete
2
Sara
I want to create a new boolean column in dataframe1 that is true if the user is in dataframe 2. So output looks like:
user
apples
bananas
new column
Pete
4
2
True
Sara
5
10
True
Kara
4
2
False
Tom
3
3
False
I tried using lambda function but didn't get very far.
Here is an easy way of doing that.
df = df.reset_index()
df2['new_column']=True
df = pd.merge(df, df2, left_on='user', right_on='user', how = 'left')
df.new_column.fillna(False, inplace=True)
You can leverage the indicator param of df.merge. Then use df.replace:
In [598]: x = df1.merge(df2['user'], left_on='user (index)', right_on='user', how='left', indicator='new column').replace({'both': True, 'left_only':False}).drop('user', 1)
In [599]: x
Out[599]:
user (index) apples bananas new column
0 Pete 4 2 True
1 Sara 5 10 True
2 Kara 4 2 False
3 Tom 3 3 False
OR:
For some better performance, use Series.map instead of df.replace:
In [609]: y = df1.merge(df2['user'], left_on='user (index)', right_on='user', how='left', indicator='new column').drop('user', 1)
In [611]: y['new column'] = y['new column'].map({'both': True, 'left_only':False})
In [612]: y
Out[612]:
user (index) apples bananas new column
0 Pete 4 2 True
1 Sara 5 10 True
2 Kara 4 2 False
3 Tom 3 3 False

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

Retain only duplicated rows in a pandas dataframe

I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?
Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]
Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1

Pandas: merge dataframes without creating new columns

I've got 2 dataframes with identical columns:
df1 = pd.DataFrame([['Abe','1','True'],['Ben','2','True'],['Charlie','3','True']], columns=['Name','Number','Other'])
df2 = pd.DataFrame([['Derek','4','False'],['Ben','5','False'],['Erik','6','False']], columns=['Name','Number','Other'])
which give:
Name Number Other
0 Abe 1 True
1 Ben 2 True
2 Charlie 3 True
and
Name Number Other
0 Derek 4 False
1 Ben 5 False
2 Erik 6 False
I want an output dataframe that is an intersection of the two based on "Name":
output_df =
Name Number Other
0 Ben 2 True
1 Ben 5 False
I've tried a basic pandas merge but the return is non-desirable:
pd.merge(df1,df2,how='inner',on='Name') =
Name Number_x Other_x Number_y Other_y
0 Ben 2 True 5 False
These dataframes are quite large so I'd prefer to use some pandas magic to keep things quick.
You can use concat and then filter by isin with numpy.intersect1d using boolean indexing:
val = np.intersect1d(df1.Name, df2.Name)
print (val)
['Ben']
df = pd.concat([df1,df2], ignore_index=True)
print (df[df.Name.isin(val)])
Name Number Other
1 Ben 2 True
4 Ben 5 False
Another possible solution for val is intersection of sets:
val = set(df1.Name).intersection(set(df2.Name))
print (val)
{'Ben'}
Then is possible reset index to monotonic:
df = pd.concat([df1,df2])
print (df[df.Name.isin(val)].reset_index(drop=True))
Name Number Other
0 Ben 2 True
1 Ben 5 False

Categories

Resources