Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.
merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.
I have a DataFrame like this:
student marks term
steve 55 1
jordan 66 2
steve 53 1
alan 74 2
jordan 99 1
steve 81 2
alan 78 1
alan 76 2
jordan 48 1
I would like to return highest two scores for each student
student marks term
steve 81 2
steve 55 1
jordan 99 1
jordan 66 2
alan 78 1
alan 76 2
I have tried
df = df.groupby('student')['marks'].max()
but it returns 1 row, I would like each student in the order they are mentioned with top two scores.
You could use groupby + nlargest to find the 2 largest values; then use loc to sort in the order they appear in df:
out = (df.groupby('student')['marks'].nlargest(2)
.droplevel(1)
.loc[df['student'].drop_duplicates()]
.reset_index())
Output:
student marks
0 steve 81
1 steve 55
2 jordan 99
3 jordan 66
4 alan 78
5 alan 76
If you want to keep "terms" as well, you could use the index:
idx = df.groupby('student')['marks'].nlargest(2).index.get_level_values(1)
out = df.loc[idx].set_index('student').loc[df['student'].drop_duplicates()].reset_index()
Output:
student marks term
0 steve 81 2
1 steve 55 1
2 jordan 99 1
3 jordan 66 2
4 alan 78 1
5 alan 76 2
#sammywemmy suggested a better way to derive the second result:
out = (df.loc[df.groupby('student', sort=False)['marks'].nlargest(2)
.index.get_level_values(1)]
.reset_index(drop=True))
You should use:
df = df.groupby(['student', 'term'])['marks'].max()
(with an optional .reset_index() )
Sorting before grouping should suffice, since you need to keep the term column:
df.sort_values('marks').groupby('student', sort = False).tail(2)
student marks term
0 steve 55 1
1 jordan 66 2
7 alan 76 2
6 alan 78 1
5 steve 81 2
4 jordan 99 1
I have two Dataframes
df1 df2
fname lname age fname lname Position
0 Jack Lee 45 0 Jack Ray 25
1 Joy Kay 34 1 Chris Kay 34
2 Jeff Kim 54 2 Kim Xi 34
3 Josh Chris 29 3 Josh McC 24
4 David Lee 56
5 Aron Dev 41
6 Jack Lee 45
7 Shane Gab 43
8 Joy Kay 34
9 Jack Lee 45
want to compare fname and lname from two dfs and append to a list, Since there is a possibility of multiple repetition of entries from df1 in df2.
(Ex. data of row 1 in df1 is present in row 6 and 9 of df2.)
not very clear on how to fetch one row from df1 and compare with all the rows of df2.(One to many Comparison)
please assist me on the same.
Using pd.merge() with indicator=True will return a clear comparison between the two dataframes based on the columns 'fname' and 'lname':
df = pd.merge(df2,
df1[['fname','lname']],
on=['fname','lname'],
how='left',
indicator=True)
prints:
print(df)
fname lname Position _merge
0 Jack Ray 25 left_only
1 Chris Kay 34 left_only
2 Kim Xi 34 left_only
3 Josh McC 24 left_only
4 David Lee 56 left_only
5 Aron Dev 41 left_only
6 Jack Lee 45 both
7 Shane Gab 43 left_only
8 Joy Kay 34 both
9 Jack Lee 45 both
So I have a df like this
In [1]:data= {'Group': ['A','A','A','A','A','A','B','B','B','B'],
'Name': [ ' Sheldon Webb',' Traci Dean',' Chad Webster',' Ora Harmon',' Elijah Mendoza',' June Strickland',' Beth Vasquez',' Betty Sutton',' Joel Gill',' Vernon Stone'],
'Performance':[33,64,142,116,122,68,95,127,132,80]}
In [2]:df = pd.DataFrame(data, columns = ['Group', 'Name','Performance'])
Out[1]:
Group Name Performance
0 A Sheldon Webb 33
1 A Traci Dean 64
2 A Chad Webster 142
3 A Ora Harmon 116
4 A Elijah Mendoza 122
5 A June Strickland 68
6 B Beth Vasquez 95
7 B Betty Sutton 127
8 B Joel Gill 132
9 B Vernon Stone 80
I want to sort it in such an alternating way that within a group, say group "A", the first row should have its highest performing person (in this case "Chad Webster") and then in the second row the least performing (which is "Sheldon Webb").
The output I am looking for would look something like this:
Out[2]:
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
You can see the sequence is alternating between the highest and lowest within a group.
Take the sorted order and then apply a quadratic function to it where the root is 1/2 the length of the array (plus some small offset). This way the highest rank is given to the extremal values (the sign of the eps offset determines whether you want a the highest value ranked above the lowest value). I added a small group at the end to show how it properly handles repeated values or an odd group size.
def extremal_rank(s):
eps = 10**-4
y = (pd.Series(np.arange(1, len(s)+1), index=s.sort_values().index)
- (len(s)+1)/2 + eps)**2
return y.reindex_like(s)
df['rnk'] = df.groupby('Group')['Performance'].apply(extremal_rank)
df = df.sort_values(['Group', 'rnk'], ascending=[True, False])
Group Name Performance rnk
2 A Chad Webster 142 6.2505
0 A Sheldon Webb 33 6.2495
4 A Elijah Mendoza 122 2.2503
1 A Traci Dean 64 2.2497
3 A Ora Harmon 116 0.2501
5 A June Strickland 68 0.2499
8 B Joel Gill 132 2.2503
9 B Vernon Stone 80 2.2497
7 B Betty Sutton 127 0.2501
6 B Beth Vasquez 95 0.2499
11 C b 110 9.0006
12 C c 68 8.9994
10 C a 110 4.0004
13 C d 68 3.9996
15 C f 70 1.0002
16 C g 70 0.9998
14 C e 70 0.0000
You can avoid groupby if you use sort_values on Performace once ascending once descending, concat both sorted dataframes, then use sort_index and drop_duplicates to get the expected output:
df_ = (pd.concat([df.sort_values(['Group', 'Performance'], ascending=[True, False])
.reset_index(), #need the original index for later drop_duplicates
df.sort_values(['Group', 'Performance'], ascending=[True, True])
.reset_index()
.set_index(np.arange(len(df))+0.5)], # for later sort_index
axis=0)
.sort_index()
.drop_duplicates('index', keep='first')
.reset_index(drop=True)
[['Group', 'Name', 'Performance']]
)
print(df_)
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
Apply the sorted concatenation of nlargest and nsmallest for each group:
>>> (df.groupby('Group')[df.columns[1:]]
.apply(lambda x:
pd.concat([x.nlargest(x.shape[0]//2,'Performance').reset_index(),
x.nsmallest(x.shape[0]-x.shape[0]//2,'Performance').reset_index()]
)
.sort_index()
.drop('index',1))
.reset_index().drop('level_1',1))
Group Name Performance
0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
6 B Joel Gill 132
7 B Vernon Stone 80
8 B Betty Sutton 127
9 B Beth Vasquez 95
Just another method using custom function with np.empty:
def mysort(s):
arr = s.to_numpy()
c = np.empty(arr.shape, dtype=arr.dtype)
idx = arr.shape[0]//2 if not arr.shape[0]%2 else arr.shape[0]//2+1
c[0::2], c[1::2] = arr[:idx], arr[idx:][::-1]
return pd.DataFrame(c, columns=s.columns)
print (df.sort_values("Performance", ascending=False).groupby("Group").apply(mysort))
Group Name Performance
Group
A 0 A Chad Webster 142
1 A Sheldon Webb 33
2 A Elijah Mendoza 122
3 A Traci Dean 64
4 A Ora Harmon 116
5 A June Strickland 68
B 0 B Joel Gill 132
1 B Vernon Stone 80
2 B Betty Sutton 127
3 B Beth Vasquez 95
Benchmark:
Let's try detecting the min, max rows with groupby().transform(), then sort:
groups = df.groupby('Group')['Performance']
mins, maxs = groups.transform('min'), groups.transform('max')
(df.assign(temp=df['Performance'].eq(mins) | df['Performance'].eq(maxs))
.sort_values(['Group','temp','Performance'],
ascending=[True, False, False])
.drop('temp', axis=1)
)
Output:
Group Name Performance
2 A Chad Webster 142
0 A Sheldon Webb 33
4 A Elijah Mendoza 122
3 A Ora Harmon 116
5 A June Strickland 68
1 A Traci Dean 64
8 B Joel Gill 132
9 B Vernon Stone 80
7 B Betty Sutton 127
6 B Beth Vasquez 95
I have a column in a pandas dataframe with street names such as
88 SØNDRE VEI 54
89 UTSIKTVEIEN 20B
92 KAARE MOURSUNDS VEG 14 A
94 OKSVALVEIEN 19
96 SLEMDALSVINGEN 33A
97 GAMLESTRØMSVEIEN 59
100 JONAS LIES VEI 68 A
what i want is to get separate columns for the street name, street number and street letter. Is there a way using pd.apply and using join to split the street names into three columns?
Thanks!
Edit: The 20B should be splittet to a value of 20 and B separately.
IIUC, you can use this regex:
df[1].str.extract('(\D+)\s+(\d+)\s?(.*)')
Output:
0 1 2
0 SØNDRE VEI 54
1 UTSIKTVEIEN 20 B
2 KAARE MOURSUNDS VEG 14 A
3 OKSVALVEIEN 19
4 SLEMDALSVINGEN 33 A
5 GAMLESTRØMSVEIEN 59
6 JONAS LIES VEI 68 A