Create new dataframe column based on 2 criteria from another dataframe - python

I have the following dataframe:
df=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],
'Salary_in_1000':[100,93,78,120,64,115],
'FT_Team':['STEELERS','SEAHAWKS','FALCONS','FALCONS','PATRIOTS','STEELERS']})
df output:
- Name Age Salary_in_1000 FT_Team
0 JOHN 35 100 STEELERS
1 ALLEN 42 93 SEAHAWKS
2 BOB 63 78 FALCONS
3 NIKI 29 120 FALCONS
4 CHARLIE 47 64 PATRIOTS
5 CHANG 51 115 STEELERS
And my dataframe that I am trying to complete:
df1=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],})
df1 output:
- Name Age
0 JOHN 35
1 ALLEN 42
2 BOB 63 78
3 NIKI 29
4 CHARLIE 47
5 CHANG 51
I would like to add a new column to df1 that references ['FT_Team'] from df based upon 'Name' and 'Age' in df1.
I believe that the new code should look something like a .map; however, I am completely stumped as to what the arguments would be for multiple arguments.
df1['FT_Team] =
final output:
- Name Age FT_Team
0 JOHN 35 STEELERS
1 ALLEN 42 SEAHAWKS
2 BOB 63 FALCONS
3 NIKI 29 FALCONS
4 CHARLIE 47 PATRIOTS
5 CHANG 51 STEELERS
Ultimately, I would like to match the football team from df based upon Name AND Age in df1

Per Quang Hoang:
df1=df1.merge(df[['Name','Age','FT_Team']], on=['Name','Age'], how='left')

Related

Pandas: Joining two Dataframes based on two criteria matches

Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.
merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.

Get the top 2 values for each unique value in another column

I have a DataFrame like this:
student marks term
steve 55 1
jordan 66 2
steve 53 1
alan 74 2
jordan 99 1
steve 81 2
alan 78 1
alan 76 2
jordan 48 1
I would like to return highest two scores for each student
student marks term
steve 81 2
steve 55 1
jordan 99 1
jordan 66 2
alan 78 1
alan 76 2
I have tried
df = df.groupby('student')['marks'].max()
but it returns 1 row, I would like each student in the order they are mentioned with top two scores.
You could use groupby + nlargest to find the 2 largest values; then use loc to sort in the order they appear in df:
out = (df.groupby('student')['marks'].nlargest(2)
.droplevel(1)
.loc[df['student'].drop_duplicates()]
.reset_index())
Output:
student marks
0 steve 81
1 steve 55
2 jordan 99
3 jordan 66
4 alan 78
5 alan 76
If you want to keep "terms" as well, you could use the index:
idx = df.groupby('student')['marks'].nlargest(2).index.get_level_values(1)
out = df.loc[idx].set_index('student').loc[df['student'].drop_duplicates()].reset_index()
Output:
student marks term
0 steve 81 2
1 steve 55 1
2 jordan 99 1
3 jordan 66 2
4 alan 78 1
5 alan 76 2
#sammywemmy suggested a better way to derive the second result:
out = (df.loc[df.groupby('student', sort=False)['marks'].nlargest(2)
.index.get_level_values(1)]
.reset_index(drop=True))
You should use:
df = df.groupby(['student', 'term'])['marks'].max()
(with an optional .reset_index() )
Sorting before grouping should suffice, since you need to keep the term column:
df.sort_values('marks').groupby('student', sort = False).tail(2)
student marks term
0 steve 55 1
1 jordan 66 2
7 alan 76 2
6 alan 78 1
5 steve 81 2
4 jordan 99 1

One to many comparison of data in two different DFs

I have two Dataframes
df1 df2
fname lname age fname lname Position
0 Jack Lee 45 0 Jack Ray 25
1 Joy Kay 34 1 Chris Kay 34
2 Jeff Kim 54 2 Kim Xi 34
3 Josh Chris 29 3 Josh McC 24
4 David Lee 56
5 Aron Dev 41
6 Jack Lee 45
7 Shane Gab 43
8 Joy Kay 34
9 Jack Lee 45
want to compare fname and lname from two dfs and append to a list, Since there is a possibility of multiple repetition of entries from df1 in df2.
(Ex. data of row 1 in df1 is present in row 6 and 9 of df2.)
not very clear on how to fetch one row from df1 and compare with all the rows of df2.(One to many Comparison)
please assist me on the same.
Using pd.merge() with indicator=True will return a clear comparison between the two dataframes based on the columns 'fname' and 'lname':
df = pd.merge(df2,
df1[['fname','lname']],
on=['fname','lname'],
how='left',
indicator=True)
prints:
print(df)
fname lname Position _merge
0 Jack Ray 25 left_only
1 Chris Kay 34 left_only
2 Kim Xi 34 left_only
3 Josh McC 24 left_only
4 David Lee 56 left_only
5 Aron Dev 41 left_only
6 Jack Lee 45 both
7 Shane Gab 43 left_only
8 Joy Kay 34 both
9 Jack Lee 45 both

How to sort a group in a way that I get the largest number in the first row and smallest in the second and the second largest in the third and so on

So I have a df like this
In [1]:data= {'Group': ['A','A','A','A','A','A','B','B','B','B'],
'Name': [ ' Sheldon Webb',' Traci Dean',' Chad Webster',' Ora Harmon',' Elijah Mendoza',' June Strickland',' Beth Vasquez',' Betty Sutton',' Joel Gill',' Vernon Stone'],
'Performance':[33,64,142,116,122,68,95,127,132,80]}
In [2]:df = pd.DataFrame(data, columns = ['Group', 'Name','Performance'])
Out[1]:
Group Name Performance
0 A Sheldon Webb 33
1 A Traci Dean 64
2 A Chad Webster 142
3 A Ora Harmon 116
4 A Elijah Mendoza 122
5 A June Strickland 68
6 B Beth Vasquez 95
7 B Betty Sutton 127
8 B Joel Gill 132
9 B Vernon Stone 80
I want to sort it in such an alternating way that within a group, say group "A", the first row should have its highest performing person (in this case "Chad Webster") and then in the second row the least performing (which is "Sheldon Webb").
The output I am looking for would look something like this:
Out[2]:
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
You can see the sequence is alternating between the highest and lowest within a group.
Take the sorted order and then apply a quadratic function to it where the root is 1/2 the length of the array (plus some small offset). This way the highest rank is given to the extremal values (the sign of the eps offset determines whether you want a the highest value ranked above the lowest value). I added a small group at the end to show how it properly handles repeated values or an odd group size.
def extremal_rank(s):
eps = 10**-4
y = (pd.Series(np.arange(1, len(s)+1), index=s.sort_values().index)
- (len(s)+1)/2 + eps)**2
return y.reindex_like(s)
df['rnk'] = df.groupby('Group')['Performance'].apply(extremal_rank)
df = df.sort_values(['Group', 'rnk'], ascending=[True, False])
Group Name Performance rnk
2 A Chad Webster 142 6.2505
0 A Sheldon Webb 33 6.2495
4 A Elijah Mendoza 122 2.2503
1 A Traci Dean 64 2.2497
3 A Ora Harmon 116 0.2501
5 A June Strickland 68 0.2499
8 B Joel Gill 132 2.2503
9 B Vernon Stone 80 2.2497
7 B Betty Sutton 127 0.2501
6 B Beth Vasquez 95 0.2499
11 C b 110 9.0006
12 C c 68 8.9994
10 C a 110 4.0004
13 C d 68 3.9996
15 C f 70 1.0002
16 C g 70 0.9998
14 C e 70 0.0000
You can avoid groupby if you use sort_values on Performace once ascending once descending, concat both sorted dataframes, then use sort_index and drop_duplicates to get the expected output:
df_ = (pd.concat([df.sort_values(['Group', 'Performance'], ascending=[True, False])
.reset_index(), #need the original index for later drop_duplicates
df.sort_values(['Group', 'Performance'], ascending=[True, True])
.reset_index()
.set_index(np.arange(len(df))+0.5)], # for later sort_index
axis=0)
.sort_index()
.drop_duplicates('index', keep='first')
.reset_index(drop=True)
[['Group', 'Name', 'Performance']]
)
print(df_)
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
Apply the sorted concatenation of nlargest and nsmallest for each group:
>>> (df.groupby('Group')[df.columns[1:]]
.apply(lambda x:
pd.concat([x.nlargest(x.shape[0]//2,'Performance').reset_index(),
x.nsmallest(x.shape[0]-x.shape[0]//2,'Performance').reset_index()]
)
.sort_index()
.drop('index',1))
.reset_index().drop('level_1',1))
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
Just another method using custom function with np.empty:
def mysort(s):
arr = s.to_numpy()
c = np.empty(arr.shape, dtype=arr.dtype)
idx = arr.shape[0]//2 if not arr.shape[0]%2 else arr.shape[0]//2+1
c[0::2], c[1::2] = arr[:idx], arr[idx:][::-1]
return pd.DataFrame(c, columns=s.columns)
print (df.sort_values("Performance", ascending=False).groupby("Group").apply(mysort))
Group Name Performance
Group
A 0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
B 0 B Joel Gill 132
1 B Vernon Stone 80
2 B Betty Sutton 127
3 B Beth Vasquez 95
Benchmark:
Let's try detecting the min, max rows with groupby().transform(), then sort:
groups = df.groupby('Group')['Performance']
mins, maxs = groups.transform('min'), groups.transform('max')
(df.assign(temp=df['Performance'].eq(mins) | df['Performance'].eq(maxs))
.sort_values(['Group','temp','Performance'],
ascending=[True, False, False])
.drop('temp', axis=1)
)
Output:
Group Name Performance
2 A Chad Webster 142
0 A Sheldon Webb 33
4 A Elijah Mendoza 122
3 A Ora Harmon 116
5 A June Strickland 68
1 A Traci Dean 64
8 B Joel Gill 132
9 B Vernon Stone 80
7 B Betty Sutton 127
6 B Beth Vasquez 95

Separate street name strings with street number and letter python

I have a column in a pandas dataframe with street names such as
88 SØNDRE VEI 54
89 UTSIKTVEIEN 20B
92 KAARE MOURSUNDS VEG 14 A
94 OKSVALVEIEN 19
96 SLEMDALSVINGEN 33A
97 GAMLESTRØMSVEIEN 59
100 JONAS LIES VEI 68 A
what i want is to get separate columns for the street name, street number and street letter. Is there a way using pd.apply and using join to split the street names into three columns?
Thanks!
Edit: The 20B should be splittet to a value of 20 and B separately.
IIUC, you can use this regex:
df[1].str.extract('(\D+)\s+(\d+)\s?(.*)')
Output:
0 1 2
0 SØNDRE VEI 54
1 UTSIKTVEIEN 20 B
2 KAARE MOURSUNDS VEG 14 A
3 OKSVALVEIEN 19
4 SLEMDALSVINGEN 33 A
5 GAMLESTRØMSVEIEN 59
6 JONAS LIES VEI 68 A

Categories

Resources