I can't wrap head around how to do this, but I want to go from this DataFrame:
Date Value
Jan-15 300
Feb-15 302
Mar-15 303
Apr-15 305
May-15 307
Jun-15 307
Jul-15 305
Aug-15 306
Sep-15 308
Oct-15 310
Nov-15 309
Dec-15 312
Jan-16 315
Feb-16 317
Mar-16 315
Apr-16 315
May-16 312
Jun-16 314
Jul-16 312
Aug-16 313
Sep-16 316
Oct-16 316
Nov-16 316
Dec-16 312
To this one by calculating over-the-month and over-the-year change:
Date Value otm oty
Jan-15 300 na na
Feb-15 302 2 na
Mar-15 303 1 na
Apr-15 305 2 na
May-15 307 2 na
Jun-15 307 0 na
Jul-15 305 -2 na
Aug-15 306 1 na
Sep-15 308 2 na
Oct-15 310 2 na
Nov-15 309 -1 na
Dec-15 312 3 na
Jan-16 315 3 15
Feb-16 317 2 15
Mar-16 315 -2 12
Apr-16 315 0 10
May-16 312 -3 5
Jun-16 314 2 7
Jul-16 312 -2 7
Aug-16 313 1 7
Sep-16 316 3 8
Oct-16 316 0 6
Nov-16 316 0 7
Dec-16 312 -4 0
So otm is calculated from the value of the field above and oty is calculated from 12 fields above.
I think you need diff, but is necessary there are not missing any month in index:
df['otm'] = df.Value.diff()
df['oty'] = df.Value.diff(12)
print (df)
Date Value otm oty
0 Jan-15 300 NaN NaN
1 Feb-15 302 2.0 NaN
2 Mar-15 303 1.0 NaN
3 Apr-15 305 2.0 NaN
4 May-15 307 2.0 NaN
5 Jun-15 307 0.0 NaN
6 Jul-15 305 -2.0 NaN
7 Aug-15 306 1.0 NaN
8 Sep-15 308 2.0 NaN
9 Oct-15 310 2.0 NaN
10 Nov-15 309 -1.0 NaN
11 Dec-15 312 3.0 NaN
12 Jan-16 315 3.0 15.0
13 Feb-16 317 2.0 15.0
14 Mar-16 315 -2.0 12.0
15 Apr-16 315 0.0 10.0
16 May-16 312 -3.0 5.0
17 Jun-16 314 2.0 7.0
18 Jul-16 312 -2.0 7.0
19 Aug-16 313 1.0 7.0
20 Sep-16 316 3.0 8.0
21 Oct-16 316 0.0 6.0
22 Nov-16 316 0.0 7.0
23 Dec-16 312 -4.0 0.0
If some data are missing it is a bit complicated:
convert to to_datetime + to_period
set_index + reindex - if are missing first Jan or last Dec values better is set it manually, not by min and max
change format of index by strftime
reset_index
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y').dt.to_period('M')
df = df.set_index('Date')
df = df.reindex(pd.period_range(df.index.min(), df.index.max(), freq='M'))
df.index = df.index.strftime('%b-%y')
df = df.rename_axis('date').reset_index()
df['otm'] = df.Value.diff()
df['oty'] = df.Value.diff(12)
print (df)
date Value otm oty
0 Jan-15 300.0 NaN NaN
1 Feb-15 302.0 2.0 NaN
2 Mar-15 NaN NaN NaN
3 Apr-15 NaN NaN NaN
4 May-15 307.0 NaN NaN
5 Jun-15 307.0 0.0 NaN
6 Jul-15 305.0 -2.0 NaN
7 Aug-15 306.0 1.0 NaN
8 Sep-15 308.0 2.0 NaN
9 Oct-15 310.0 2.0 NaN
10 Nov-15 309.0 -1.0 NaN
11 Dec-15 312.0 3.0 NaN
12 Jan-16 315.0 3.0 15.0
13 Feb-16 317.0 2.0 15.0
14 Mar-16 315.0 -2.0 NaN
15 Apr-16 315.0 0.0 NaN
16 May-16 312.0 -3.0 5.0
17 Jun-16 314.0 2.0 7.0
18 Jul-16 312.0 -2.0 7.0
19 Aug-16 313.0 1.0 7.0
20 Sep-16 316.0 3.0 8.0
21 Oct-16 316.0 0.0 6.0
22 Nov-16 316.0 0.0 7.0
23 Dec-16 312.0 -4.0 0.0
More correct solution is to shift by month frequency:
#Create datetime column
df['DateTime'] = pd.to_datetime(df['Date'], format='%b-%y')
#Set it as index
df.set_index('DateTime', inplace=True)
#Then shift by month frequency:
df['otm'] = df['Value'] - df['Value'].shift(1, freq='MS')
df['oty'] = df['Value'] - df['Value'].shift(12, freq='MS')
df['otm'] = df['Value'] - df['Value'].shift(1)
df['oty'] = df['Value'] - df['Value'].shift(12)
Related
I have csv file like the below table:
depth
x1
x2
x3
depth
x1
x2
x3
1000
Nan
Nan
Nan
1001
Nan
Nan
Nan
1002
Nan
Nan
Nan
1003
Nan
10
Nan
1004
Nan
Nan
Nan
1005
Nan
Nan
10
1006
Nan
Nan
Nan
1007
10
Nan
Nan
1008
11
Nan
Nan
1009
12
Nan
Nan
1010
13
Nan
Nan
1011
14
Nan
15
1012
15
20
Nan
1013
Nan
Nan
Nan
1014
Nan
Nan
Nan
1015
18
Nan
Nan
1016
19
Nan
Nan
1017
20
Nan
Nan
1018
21
Nan
20
1019
22
Nan
Nan
1020
23
Nan
Nan
1021
24
25
Nan
1022
25
Nan
Nan
1023
26
Nan
Nan
1024
27
Nan
25
1025
28
15
Nan
1026
Nan
Nan
Nan
1027
Nan
Nan
Nan
1028
Nan
Nan
Nan
I want interpolate between first and last valid values then fill Nan values by zeros
the result should be like that
depth
x1
x2
x3
1000
0
0
0
1001
0
0
0
1002
0
0
0
1003
0
10
0
1004
0
11.11111111
0
1005
0
12.22222222
10
1006
0
13.33333333
10.83333333
1007
10
14.44444444
11.66666667
1008
11
15.55555556
12.5
1009
12
16.66666667
13.33333333
1010
13
17.77777778
14.16666667
1011
14
18.88888889
15
1012
15
20
15.71428571
1013
16
20.55555556
16.42857143
1014
17
21.11111111
17.14285714
1015
18
21.66666667
17.85714286
1016
19
22.22222222
18.57142857
1017
20
22.77777778
19.28571429
1018
21
23.33333333
20
1019
22
23.88888889
20.83333333
1020
23
24.44444444
21.66666667
1021
24
25
22.5
1022
25
22.5
23.33333333
1023
26
20
24.16666667
1024
27
17.5
25
1025
28
15
25
1026
28
0
0
1027
28
0
0
1028
28
0
0
I have tried the below code but it does not give the correct result
import pandas as pd
df = pd.read_csv(r"C:\Users\mohamed\OneDrive\Desktop\test_interpolate.csv")
df = df.interpolate()
df = df.fillna(0)
print (df)
df.to_csv(r"C:\Users\mohamed\OneDrive\Desktop\result.csv")
You can limit the interpolation to the inner NaN using limit_area='inside' (this is well covered in the documentation of interpolate):
df = df.interpolate(limit_area='inside').fillna(0)
output:
depth x1 x2 x3
0 1000 0.0 0.000000 0.000000
1 1001 0.0 0.000000 0.000000
2 1002 0.0 0.000000 0.000000
3 1003 0.0 10.000000 0.000000
4 1004 0.0 11.111111 0.000000
5 1005 0.0 12.222222 10.000000
6 1006 0.0 13.333333 10.833333
7 1007 10.0 14.444444 11.666667
8 1008 11.0 15.555556 12.500000
9 1009 12.0 16.666667 13.333333
10 1010 13.0 17.777778 14.166667
11 1011 14.0 18.888889 15.000000
12 1012 15.0 20.000000 15.714286
13 1013 16.0 20.555556 16.428571
14 1014 17.0 21.111111 17.142857
15 1015 18.0 21.666667 17.857143
16 1016 19.0 22.222222 18.571429
17 1017 20.0 22.777778 19.285714
18 1018 21.0 23.333333 20.000000
19 1019 22.0 23.888889 20.833333
20 1020 23.0 24.444444 21.666667
21 1021 24.0 25.000000 22.500000
22 1022 25.0 22.500000 23.333333
23 1023 26.0 20.000000 24.166667
24 1024 27.0 17.500000 25.000000
25 1025 28.0 15.000000 0.000000
26 1026 0.0 0.000000 0.000000
27 1027 0.0 0.000000 0.000000
28 1028 0.0 0.000000 0.000000
Basically task is that for every customer last 5 transactions should show up but it should be on basis of that customer only.
df = pd.DataFrame({
"customer_id": [121,121,121,121,121,121,121,233,233,233,233,233,233,233,233],
"Amount": [500,300,400,239,568,243,764,890,456,420,438,234,476,568,243,]
})
So, I am trying to create 5 new columns based on shift of "Amount" column.
For this below code works well
for obs in range(1,6):
df['S_'+ str(obs)] = df.Amount.shift(obs)
output:
customer_id Amount S_1 S_2 S_3 S_4 S_5
0 121 500 NaN NaN NaN NaN NaN
1 121 300 500.0 NaN NaN NaN NaN
2 121 400 300.0 500.0 NaN NaN NaN
3 121 239 400.0 300.0 500.0 NaN NaN
4 121 568 239.0 400.0 300.0 500.0 NaN
5 121 243 568.0 239.0 400.0 300.0 500.0
6 121 764 243.0 568.0 239.0 400.0 300.0
7 233 890 764.0 243.0 568.0 239.0 400.0
8 233 456 890.0 764.0 243.0 568.0 239.0
9 233 420 456.0 890.0 764.0 243.0 568.0
10 233 438 420.0 456.0 890.0 764.0 243.0
11 233 234 438.0 420.0 456.0 890.0 764.0
12 233 476 234.0 438.0 420.0 456.0 890.0
13 233 568 476.0 234.0 438.0 420.0 456.0
14 233 243 568.0 476.0 234.0 438.0 420.0
Problem
By this method the next customer in index number 7 is also showing previous customers transactions which is wrong. It should be NaN
I think I need to group on the basis of customer_id and then get shift of the amount for each customer
And I am not able to do that.
You can use groupby when shifting:
for obs in range(1,6):
df['S_'+ str(obs)] = df.groupby(["customer_id"]).Amount.shift(obs)
which results in
customer_id Amount S_1 S_2 S_3 S_4 S_5
0 121 500 NaN NaN NaN NaN NaN
1 121 300 500.0 NaN NaN NaN NaN
2 121 400 300.0 500.0 NaN NaN NaN
3 121 239 400.0 300.0 500.0 NaN NaN
4 121 568 239.0 400.0 300.0 500.0 NaN
5 121 243 568.0 239.0 400.0 300.0 500.0
6 121 764 243.0 568.0 239.0 400.0 300.0
7 233 890 NaN NaN NaN NaN NaN
8 233 456 890.0 NaN NaN NaN NaN
9 233 420 456.0 890.0 NaN NaN NaN
10 233 438 420.0 456.0 890.0 NaN NaN
11 233 234 438.0 420.0 456.0 890.0 NaN
12 233 476 234.0 438.0 420.0 456.0 890.0
13 233 568 476.0 234.0 438.0 420.0 456.0
14 233 243 568.0 476.0 234.0 438.0 420.0
you can use .groupby and then .apply with your own logic, like this:
import pandas as pd
df = pd.DataFrame({
"customer_id": [121, 121, 121, 121, 121, 121, 121, 233, 233, 233, 233, 233, 233, 233, 233],
"Amount": [500, 300, 400, 239, 568, 243, 764, 890, 456, 420, 438, 234, 476, 568, 243]
})
def add_S_cols(df):
for obs in range(1, 6):
df['S_' + str(obs)] = df.Amount.shift(obs)
return df
print(df.groupby("customer_id").apply(add_S_cols))
Output:
Amount customer_id S_1 S_2 S_3 S_4 S_5
0 500 121 NaN NaN NaN NaN NaN
1 300 121 500.0 NaN NaN NaN NaN
2 400 121 300.0 500.0 NaN NaN NaN
3 239 121 400.0 300.0 500.0 NaN NaN
4 568 121 239.0 400.0 300.0 500.0 NaN
5 243 121 568.0 239.0 400.0 300.0 500.0
6 764 121 243.0 568.0 239.0 400.0 300.0
7 890 233 NaN NaN NaN NaN NaN
8 456 233 890.0 NaN NaN NaN NaN
9 420 233 456.0 890.0 NaN NaN NaN
10 438 233 420.0 456.0 890.0 NaN NaN
11 234 233 438.0 420.0 456.0 890.0 NaN
12 476 233 234.0 438.0 420.0 456.0 890.0
13 568 233 476.0 234.0 438.0 420.0 456.0
14 243 233 568.0 476.0 234.0 438.0 420.0
I am new to python and pandas and I am trying to solve this problem:
I have a dataset that looks something like this:
timestamp par_1 par_2
1486873206867 0 0
1486873207039 NaN 0
1486873207185 0 NaN
1486873207506 1 0
1486873207518 NaN NaN
1486873207831 1 0
1486873208148 0 NaN
1486873208469 0 1
1486873208479 1 NaN
1486873208793 1 NaN
1486873208959 NaN 1
1486873209111 1 NaN
1486873209918 NaN 0
1486873210075 0 NaN
I want to know the total duration of the event "1" for each parameter. (Parameters can only be NaN, 1 or 0)
I have already tried
df['duration_par_1'] = df.groupby(['par_1'])['timestamp'].apply(lambda x: x.max() - x.min())
but for further processing, I only need the duration of the event "1" to be in new columns and then that duration needs to be in every row of the new column so that it looks like this:
timestamp par_1 par_2 duration_par_1 duration_par2
1486873206867 0 0 2238 1449
1486873207039 NaN 0 2238 1449
1486873207185 0 NaN 2238 1449
1486873207506 1 0 2238 1449
1486873207518 NaN NaN 2238 1449
1486873207831 1 0 2238 1449
1486873208148 0 NaN 2238 1449
1486873208469 0 1 2238 1449
1486873208479 1 NaN 2238 1449
1486873208793 1 NaN 2238 1449
1486873208959 NaN 1 2238 1449
1486873209111 1 NaN 2238 1449
1486873209918 NaN 0 2238 1449
1486873210075 0 NaN 2238 1449
Thanks in advance!
I believe you need multiple values of par columns by difference of datetimes, because not exist another values like 0, 1 and NaN in data:
d = df['timestamp'].diff()
df1 = df.filter(like='par')
#if need duration by some value e.g. by `0`
#df1 = df.filter(like='par').eq(0).astype(int)
s = df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_')
df = df.assign(**s)
print (df)
timestamp par_1 par_2 duration_par_1 duration_par_2
0 1486873206867 0.0 0.0 1110 487
1 1486873207039 NaN 0.0 1110 487
2 1486873207185 0.0 NaN 1110 487
3 1486873207506 1.0 0.0 1110 487
4 1486873207518 NaN NaN 1110 487
5 1486873207831 1.0 0.0 1110 487
6 1486873208148 0.0 NaN 1110 487
7 1486873208469 0.0 1.0 1110 487
8 1486873208479 1.0 NaN 1110 487
9 1486873208793 1.0 NaN 1110 487
10 1486873208959 NaN 1.0 1110 487
11 1486873209111 1.0 NaN 1110 487
12 1486873209918 NaN 0.0 1110 487
13 1486873210075 0.0 NaN 1110 487
Explanation:
First get difference of timestamp column:
print (df['timestamp'].diff())
0 NaN
1 172.0
2 146.0
3 321.0
4 12.0
5 313.0
6 317.0
7 321.0
8 10.0
9 314.0
10 166.0
11 152.0
12 807.0
13 157.0
Name: timestamp, dtype: float64
Select all columns with string par by filter:
print (df.filter(like='par'))
par_1 par_2
0 0.0 0.0
1 NaN 0.0
2 0.0 NaN
3 1.0 0.0
4 NaN NaN
5 1.0 0.0
6 0.0 NaN
7 0.0 1.0
8 1.0 NaN
9 1.0 NaN
10 NaN 1.0
11 1.0 NaN
12 NaN 0.0
13 0.0 NaN
Multiple filtered columns by mul by d:
print (df1.mul(d, axis=0))
par_1 par_2
0 NaN NaN
1 0.0 0.0
2 0.0 0.0
3 321.0 0.0
4 0.0 0.0
5 313.0 0.0
6 0.0 0.0
7 0.0 321.0
8 10.0 0.0
9 314.0 0.0
10 0.0 166.0
11 152.0 0.0
12 0.0 0.0
13 0.0 0.0
And sum values:
print (df1.mul(d, axis=0).sum())
par_1 1110.0
par_2 487.0
dtype: float64
Convert to integers and change index by add_prefix:
print (df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_'))
duration_par_1 1110
duration_par_2 487
dtype: int32
Last create new columns by assign.
Here is my pandas DataFrame:
import pandas as pd
import numpy as np
data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
"column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}
df = pd.DataFrame(data)
df
>>> column1 column2
0 338 NaN
1 519 1.0
2 871 1.0
3 1731 1.0
4 2693 1.0
5 2963 NaN
6 3379 NaN
7 3789 2.0
8 3910 2.0
9 4109 NaN
10 4307 NaN
11 4800 3.0
12 4912 3.0
13 5111 3.0
14 5341 3.0
15 5820 3.0
16 6003 NaN
17 .... ....
The integers in column2 denote "groups" in column1, e.g. rows 1-4 is group "1", rows 7-8 is group "2", rows 11-15 is group "3", etc.
I would like to calculate the difference between the first row and last row in each group. The resulting dataframe would look like this:
df
>>> column1 column2 column3
0 338 NaN NaN
1 519 1.0 2174
2 871 1.0 2174
3 1731 1.0 2174
4 2693 1.0 2174
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121
8 3910 2.0 121
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020
12 4912 3.0 1020
13 5111 3.0 1020
14 5341 3.0 1020
15 5820 3.0 1020
16 6003 NaN NaN
17 .... .... ...
because:
2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020
What is the "pandas way" to calculate column3? Somehow, one must iterate through column3, looking for consecutive groups of values such that df.column2 != "NaN".
EDIT: I realized my example may lead readers to assume the values in column1 are only increasing. Actually, there are intervals, column intervals
df = pd.DataFrame(data)
df
>>> interval column1 column2
0 interval1 338 NaN
1 interval1 519 1.0
2 interval1 871 1.0
3 interval1 1731 1.0
4 interval1 2693 1.0
5 interval1 2963 NaN
6 interval1 3379 NaN
7 interval1 3789 2.0
8 interval1 3910 2.0
9 interval1 4109 NaN
10 interval1 4307 NaN
11 interval1 4800 3.0
12 interval1 4912 3.0
13 interval1 5111 3.0
14 interval1 5341 3.0
15 interval1 5820 3.0
16 interval1 6003 NaN
17 .... ....
18 interval2 12 13
19 interval2 115 13
20 interval2 275 NaN
....
You can filter first and then get difference first and last value in transform:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
column1 column2 col3
0 338 NaN NaN
1 519 1.0 2174.0
2 871 1.0 2174.0
3 1731 1.0 2174.0
4 2693 1.0 2174.0
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121.0
8 3910 2.0 121.0
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020.0
12 4912 3.0 1020.0
13 5111 3.0 1020.0
14 5341 3.0 1020.0
15 5820 3.0 1020.0
16 6003 NaN NaN
EDIT1 by your new data:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
interval column1 column2 col3
0 interval1 338 NaN NaN
1 interval1 519 1.0 2174.0
2 interval1 871 1.0 2174.0
3 interval1 1731 1.0 2174.0
4 interval1 2693 1.0 2174.0
5 interval1 2963 NaN NaN
6 interval1 3379 NaN NaN
7 interval1 3789 2.0 121.0
8 interval1 3910 2.0 121.0
9 interval1 4109 NaN NaN
10 interval1 4307 NaN NaN
11 interval1 4800 3.0 1020.0
12 interval1 4912 3.0 1020.0
13 interval1 5111 3.0 1020.0
14 interval1 5341 3.0 1020.0
15 interval1 5820 3.0 1020.0
16 interval1 6003 NaN NaN
18 interval2 12 13.0 103.0
19 interval2 115 13.0 103.0
20 interval2 275 NaN NaN
I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:
grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
c: [numpy.sum, numpy.mean, numpy.size],
d: [numpy.sum, numpy.mean, numpy.size]
})
and
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 0 0 0 0 0 0
25 0 0 0 0 0 0
How can I do this?
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(10, size=(6, 6)),
pd.MultiIndex.from_tuples(
[(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
names=['a', 'b']
),
pd.MultiIndex.from_product(
[['c', 'd'], ['sum', 'mean', 'size']]
)
)
c d
sum mean size sum mean size
a b
25 20 8 3 5 5 0 2
21 3 7 8 9 2 7
23 2 1 3 2 5 4
24 9 0 1 7 1 6
25 1 9 3 5 8 8
26 23 8 8 4 8 0 5
You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack
df.unstack(fill_value=0).stack()
c d
mean size sum mean size sum
a b
25 20 3 5 8 0 2 5
21 7 8 3 2 7 9
23 1 3 2 5 4 2
24 0 1 9 1 6 7
25 9 3 1 8 8 5
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 8 4 8 0 5 8
24 0 0 0 0 0 0
25 0 0 0 0 0 0
Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float
print(df)
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
I like:
df = df.unstack().replace(np.nan,0).stack(-1)
print(df)
c d
mean size sum mean size sum
a b
25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0
21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0
25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0
26 20 0.000000 0.0 0.0 0.000000 0.0 0.0
21 0.000000 0.0 0.0 0.000000 0.0 0.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.000000 0.0 0.0 0.000000 0.0 0.0
25 0.000000 0.0 0.0 0.000000 0.0 0.0