Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:
import numpy as np
import pandas as pd
freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")
index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")
df = pd.DataFrame(index=index, data=data, columns=cols)
# now resample to daily average
df = df.resample(rule="1D").mean()
Most of the data in this dataframe is useless and can be removed via:
df.dropna(how="all", axis=0, inplace=True)
however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?
Updating my original answer with (what I think) is an improvement, plus updated times.
Use groupby
There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:
df = df.groupby(df.index.date).mean()
This is in fact faster than the resample for your data:
%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The more general approach would be to use the floor of the timestamps to do the groupby operation:
rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2020-01-01 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):
rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 1999-12-30 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-26 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:
# custom function for flooring relative to a start date
def floor(x, freq):
offset = x[0].ceil(freq) - x[0]
return (x + offset).floor(freq) - offset
rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-28 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
# the cycle of 7 days is now starting from 01-01-2000
Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):
%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Im using 12 hours sensor data at 25Hrz that I query from mongo db into a dataframe
I'm trying to extract a list or a dict of 1 minute dataframes from the 12 hours.
I use a window of 1 minute and a stride/ step of 10 seconds.
The goal is to build a dataset by creating al list or dict of 1 minute dataframes/samples from 12 hours of data, that will be converted to tensor and fed to a deep learning model.
The index of the dataframe is datetime and 4 columns of sensor values.
here is how part of the data looks like:
A B C D
2020-06-17 22:00:00.000 1.052 -0.147 0.836 0.623
2020-06-17 22:00:00.040 1.011 -0.147 0.820 0.574
2020-06-17 22:00:00.080 1.067 -0.131 0.868 0.607
2020-06-17 22:00:00.120 1.033 -0.163 0.820 0.607
2020-06-17 22:00:00.160 1.030 -0.147 0.820 0.607
below is a sample code that is similar to how I extract windows of 1 minutes data. For 12 hours it takes 5 minutes-which is a long time..
Any ideas on how to reduce the running time in this case?
step= 10*25
w=60*25
df # 12 hours df data
sensor_dfs=[]
df_range = range(0, df.shape[0]-step, step)
for a in df_range:
sample = df.iloc[a:a+w]
sensor_dfs.append(sample)
I created random data and made the following experiments looking at runtime:
# create random normal samples
w= 60*25 # 1 minute window
step=w # no overlap
num_samples=50000
data= np.random.normal(size=(num_samples,3))
date_rng=pd.date_range(start="2020-07-09 00:00:00.000",
freq="40ms",periods=num_samples)
data=pd.DataFrame(data, columns=["x","y","z"], index=date_rng)
data.head()
x y z
2020-07-09 00:00:00.000 -1.062264 -0.008656 0.399642
2020-07-09 00:00:00.040 0.182398 -1.014290 -1.108719
2020-07-09 00:00:00.080 -0.489814 -0.020697 0.651120
2020-07-09 00:00:00.120 -0.776405 -0.596601 0.611516
2020-07-09 00:00:00.160 0.663900 0.149909 -0.552779
numbers are of type float64
data.dtypes
x float64
y float64
z float64
dtype: object
using for loops
minute_samples=[]
for i in range(0,len(data)-w,step):
minute_samples.append(data.iloc[i:i+w])
result:6.45 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using list comprehension
minute_samples=[data.iloc[i:i+w] for i in range(0,len(data)-w,step)]
result: 6.13 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using Grouper with list comprehension
minute_samples=[df for i, df in data.groupby(pd.Grouper(freq="1T"))]
result:7.89 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using grouper with dict
minute_samples=dict(tuple(data.groupby(pd.Grouper(freq="1T"))))
result: 7.41 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
resample is also an option here but since behind the scenes it uses grouper then I don't think it will be different in terms of runtime
It seems like list comprehension is slightly better than the rest
I'm trying to get the mean of each column while grouped by id, BUT for the calculation only the 50% between the first 25% quantil and the third 75% quantil should be used. (So ignore the lowest 25% of values and the highest 25%)
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I tried:
data.groupby('id').quantile(0.75).mean();
#data.groupby('id').agg(lambda grp: grp.quantil(0.25, 0,75)).mean(); something like that?
CW 67.089733
fd 0.265917
fd_maxna -1929.522001
fd_maxv -1542.468399
fd_sumna -1928.239954
fd_sumv -1488.165382
planc -13.165445
slope 13.654163
Something like that, but the GroupByDataFrame.quantil doesn't know a inbetween to my knowledge and I don't know how to now remove the lower 25% too.
And this also doesn't return a dataframe.
What I want
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
Where only the data between the 25% quantil and the 75% quantil are used for the mean calculation. So only the 50% in between.
Using GroupBy.apply here can be slow so I suppose this is your data frame:
print(df)
ID Property3 Property2 Property1
0 1 10.2 58.337589 45.083237
1 1 20.1 70.844807 29.423138
2 1 51.9 67.126043 90.558225
3 1 15.8 17.478715 41.492485
4 1 12.5 18.247211 26.449900
5 1203 104.4 113.728439 130.698964
6 1203 11.5 29.659894 45.991533
7 1203 19.4 78.910591 40.049054
8 1203 23.1 78.395974 67.345487
So I would use GroupBy.cumcount + DataFrame.pivot_table
to calculate quantiles without using apply:
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
print(new_df)
Property1 Property2 Property3
ID 1 1203 1 1203 1 1203
aux
0 45.083237 130.698964 58.337589 113.728439 10.2 104.4
1 29.423138 45.991533 70.844807 29.659894 20.1 11.5
2 90.558225 40.049054 67.126043 78.910591 51.9 19.4
3 41.492485 67.345487 17.478715 78.395974 15.8 23.1
4 26.449900 NaN 18.247211 NaN 12.5 NaN
#remove aux column
df=df.drop('aux',axis=1)
Now we calculate the mean using boolean indexing:
new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
ID
Property1 1 59.963006
1203 70.661294
Property2 1 49.863814
1203 45.703292
Property3 1 15.800000
1203 21.250000
dtype: float64
or create DataFrame with the mean:
mean_df=( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
print(mean_df)
Property Property1 Property2 Property3
ID
1 41.492485 58.337589 15.80
1203 56.668510 78.653283 21.25
Measure times:
%%timeit
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
df=df.drop('aux',axis=1)
( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
25.2 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
df.groupby("ID").apply(lambda x: x.apply(mean_of_25_to_75_pct))
33 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = df.groupby("ID").apply(filter_mean)
23 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is even almost faster with a small data frame, in larger data frames such as its original data frame, it would be much faster than the other proposed methods, see: when use apply
You can use the quantile function to return multiple quantiles. Then, you can filter out values based on this, and compute the mean:
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = data.groupby("id").apply(filter_mean)
Please try this.
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
data.groupby("id").apply(lambda x: x.apply(mean_of_25_to_75_pct))
You could use scipy ready-made function for trimmed mean, trim_mean():
from scipy import stats
means = data.groupby("id").apply(stats.trim_mean, 0.25)
If you insist on getting a dataframe, you could:
data.groupby("id").agg(lambda x: stats.trim_mean(x, 0.25)).reset_index()
I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.
Sample data
import pandas as pd
import numpy as np
import copy
import random
df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
df_tidy = df_tidy.append(
{
'Stage': random.choice(['OP', 'FUEL', 'EOL']),
'Exc': str(np.random.randint(low=0, high=1000)),
'Cat': random.choice(['CC', 'HT', 'PM']),
'Score': np.random.random(),
}, ignore_index=True
)
df_tidy
returns
Stage Exc Cat Score
0 OP 929 HT 0.946234
1 OP 813 CC 0.829522
2 FUEL 114 PM 0.868605
3 OP 896 CC 0.382077
4 FUEL 10 CC 0.832246
5 FUEL 515 HT 0.632220
6 EOL 970 PM 0.532310
7 FUEL 198 CC 0.209856
8 FUEL 848 CC 0.479470
9 OP 968 HT 0.348093
I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:
Working but probably inefficient approach
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df
Which returns what I want:
OP FUEL EOL Total
CC 1.2116 1.52157 NaN 2.733170
HT 1.29433 0.63222 NaN 1.926548
PM NaN 0.868605 0.53231 1.400915
But I cannot believe this is the simplest or most efficient path.
Question
What pandas magic am I missing out on?
Update - Timing the proposed solutions
To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:
groupby solution, that I thought was inefficient:
%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:
%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pivot_table solution:
%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, it would appear that the clunky groupbysolution is the quickest.
A simple solution from crosstab
pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]:
Stage EOL FUEL OP Total
Cat
CC NaN 1.521572 1.211599 2.733171
HT NaN 0.632220 1.294327 1.926547
PM 0.53231 0.868605 NaN 1.400915
I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]