Calculating expanding mean on 2 columns simultaneously - python

I have a table of 2 players competing each other:
date plA plB ptsA ptsB
0 01/01/2013 Jeff Tom 78 72
1 15/01/2013 Jeff Tom 52 67
2 01/02/2013 Tom Jeff 91 93
3 15/02/2013 Jeff Tom 83 87
4 01/03/2013 Tom Jeff 65 76
I want to apply the expanding mean, such that ptsA and ptsB for each player get counted in (and are not left) to the net result. Final output should make it more clear:
date plA plB ptsA ptsB meanA meanB
0 01/01/2013 Jeff Tom 78 72 78 72 # init mean
1 15/01/2013 Jeff Tom 52 67 65 69.5
2 01/02/2013 Tom Jeff 91 93 74.3 76.6 # Tom: (72+67+91)/3, Jeff: (78+52+93)/3
3 15/02/2013 Jeff Tom 83 87 76.5 79.25 # Jeff: (78+52+93+83)/4, Tom: (72+67+91+87)/4
4 01/03/2013 Tom Jeff 65 76 76.4 76.4 # Tom: (72+67+91+87+65)/5, Jeff: (78+52+93+83+76)/5
Now, I started grouping data by plA and like this:
by_A = players.sort(columns='date').groupby('plA')
players['meanA'] = by_A['ptsA'].apply(pd.expanding_mean)
players['meanB'] = by_A['ptsB'].apply(pd.expanding_mean)
and obviously, I need to do the same, and groupby('plB') and then Im drawing a blank how to join these two results correctly.
Perhaps pandas offers a built-in or you have a solution for it?
#EDIT Saullo Castro's solution with slightly different data
date studentA studentB scoreA scoreB meanJeff meanTom meanMaggie
0 2013-01-01 Jeff Tom 78 72 78.000000 72.000000 0.000000
1 2013-01-15 Jeff Maggie 52 67 65.000000 36.000000 33.500000
2 2013-02-01 Tom Jeff 91 93 74.333333 54.333333 22.333333
3 2013-02-15 Jeff Tom 83 87 76.500000 62.500000 16.750000
4 2013-03-01 Tom Jeff 65 76 76.400000 63.000000 13.400000
Maggie's mean should stay 67 all the way.

(Please, refer to the fixed solution below)
One approach is to find out all the player's names first:
names = pd.concat((df.plA, df.plB)).unique()
Then create one new column with the expanding mean for each player:
for name in names:
df['mean'+name] = pd.expanding_mean(df.ptsA*(df.plA==name) + df.ptsB*(df.plB==name))
Resulting in:
date plA plB ptsA ptsB meanJeff meanTom
0 2013-01-01 00:00:00 Jeff Tom 78 72 78.000000 72.000000
1 15/01/2013 Jeff Tom 52 67 65.000000 69.500000
2 2013-01-02 00:00:00 Tom Jeff 91 93 74.333333 76.666667
3 15/02/2013 Jeff Tom 83 87 76.500000 79.250000
4 2013-01-03 00:00:00 Tom Jeff 65 76 76.400000 76.400000
EDIT: Fixed solution:
For more than two names this is how you can build the formula for the expanding mean:
df = pd.read_excel('stack.xlsx', 'tabelle1')
names = pd.concat((df.plA, df.plB)).unique()
for name in names:
nA = df.plA==name
nB = df.plB==name
df['mean'+name] = np.cumsum(df.ptsA*nA + df.ptsB*nB)/np.maximum(1.,
np.cumsum(1.*np.logical_or(nA, nB)))
Resulting in:
date plA plB ptsA ptsB meanJeff meanTom meanMaggie
0 2013-01-01 00:00:00 Jeff Tom 78 72 78.000000 72.000000 0
1 2013-01-15 00:00:00 Jeff Maggie 52 67 65.000000 72.000000 67
2 2013-02-01 00:00:00 Tom Jeff 91 93 74.333333 81.500000 67
3 2013-02-15 00:00:00 Jeff Tom 83 87 76.500000 83.333333 67
4 2013-03-01 00:00:00 Tom Jeff 65 76 76.400000 78.750000 67

Related

Pandas: Joining two Dataframes based on two criteria matches

Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.
merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.

How to add a row to every group with pandas groupby?

I wish to add a new row in the first line within each group, my raw dataframe is:
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person ('ID'), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wrote some loop conditions but quite slow, If you have any good ideas, please help. Thanks a lot
Let's try groupby.apply here. We'll append a row to each group at the start, like this:
def augment_group(group):
first_row = group.iloc[[0]]
first_row['Date'] += pd.Timedelta(days=1)
return first_row.append(group)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
(df.groupby('ID', as_index=False, group_keys=False)
.apply(augment_group)
.reset_index(drop=True))
ID From_num To_num Date
0 James 78 96 2020-05-13
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02
17 Wong Started 39 2019-02-01
Although I agree with #Joran Beasley in the comments that this feels like somewhat of an XY problem. Perhaps try clarifying the problem you're trying to solve, instead of asking how to implement what you think is the solution to your issue?

How to extract only rows whose column is notnull and put them in a variable

I wonder how to extract only rows whose column is notnull and put them in a variable.
My code is
data_result = df[df['english'].isnull().sum()==0]
But an Error accured. How do I fix it?
Dataframe:
name math english
0 John 90 nan
1 Ann 85 84
2 Brown 77 nan
3 Eva 92 93
4 Anita 91 90
5 Jimmy 75 69
Result
name math english
1 Ann 85 84
3 Eva 92 93
4 Anita 91 90
5 Jimmy 75 69
Try this:
data_result = df[df['english'].notnull()]

Preserving multindex column structure after performing a groupby summation

I have a three-level multiindex column. At the lowest level, I want to add a subtotal column.
So in the example here, I would expect a new column zone: day, person:dave, find:'subtotal' with value = 49+27+63=138. similarly for all the other combinations of zone and person.
cols = pd.MultiIndex.from_product([['day', 'night'], ['dave', 'matt', 'mike'], ['gems', 'rocks', 'paper']])
rows = pd.date_range(start='20191201', periods=5, freq="d")
data = np.random.randint(0, high=100,size=(len(rows), len(cols)))
xf = pd.DataFrame(data, index=rows, columns=cols)
xf.columns.names = ['zone', 'person', 'find']
I can generate the correct subtotal data with xf.groupby(level=[0,1], axis="columns").sum() but then I lose the find level of the columns, it only leaves the zone and person levels. I need that third level of column called subtotal so that I can join that back with the original xf dataframe. But I cannot figure out a nice pythonic way to add a third level back into the multindex.
You can use sum first and then MultiIndex.from_product with new level:
df = xf.sum(level=[0,1], axis="columns")
df.columns = pd.MultiIndex.from_product(df.columns.levels + [['subtotal']])
print (df)
day night
dave matt mike dave matt mike
subtotal subtotal subtotal subtotal subtotal subtotal
2019-12-01 85 99 163 210 93 252
2019-12-02 38 113 101 211 110 135
2019-12-03 145 75 122 181 165 176
2019-12-04 220 184 173 179 134 192
2019-12-05 126 77 29 184 178 199
And then join together by concat with DataFrame.sort_index:
df = pd.concat([xf, df], axis=1).sort_index(axis=1)
print (df)
zone day \
person dave matt mike
find gems paper rocks subtotal gems paper rocks subtotal gems paper
2019-12-01 33 96 24 153 34 89 90 213 15 51
2019-12-02 74 48 61 183 94 83 2 179 75 4
2019-12-03 88 85 51 224 65 3 52 120 95 80
2019-12-04 43 28 60 131 43 14 77 134 88 54
2019-12-05 41 72 44 157 63 77 37 177 8 66
zone ... night \
person ... dave matt mike
find ... rocks subtotal gems paper rocks subtotal gems paper rocks
2019-12-01 ... 24 102 19 49 4 72 43 57 92
2019-12-02 ... 90 206 96 55 92 243 75 58 68
2019-12-03 ... 29 182 11 90 85 186 9 20 46
2019-12-04 ... 30 84 25 55 89 169 98 41 85
2019-12-05 ... 73 167 52 90 49 191 51 80 37
zone
person
find subtotal
2019-12-01 192
2019-12-02 201
2019-12-03 75
2019-12-04 224
2019-12-05 168
[5 rows x 24 columns]

Conditional summing of columns in pandas

I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.
You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64
John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)

Categories

Resources