Adding total row to pandas DataFrame groupby - python

I am aware of this link but I didn't manage to solve my problem.
I have this below DataFrame from pandas.DataFrame.groupby().sum():
Value
Level Company Item
1 X a 100
b 200
Y a 35
b 150
c 35
2 X a 48
b 100
c 50
Y a 80
and would like to add total rows for each level of index that I have to get:
Value
Level Company Item
1 X a 100
b 200
Total 300
Y a 35
b 150
c 35
Total 520
Total 820
2 X a 48
b 100
c 50
Total 198
Y a 80
Total 80
Total 278
Total 1098
As request
level = list(map(int, list('111112222')))
company = list('XXYYYXXXY')
item = list('ababcabca')
value = [100,200,35,150,35,48,100,50,80]
col = ['Level', 'Company', 'Item', 'Value']
df = pd.DataFrame([level,company,item,value]).T
df.columns = col
df.groupby(['Level', 'Company', 'Item']).sum()

You can use:
m=df.groupby(['Level','Company','Item'])['Value'].sum().unstack()
m.assign(total=m.sum(1)).stack().to_frame('Value')
Value
Level Company Item
1 X a 100.0
b 200.0
total 300.0
Y a 35.0
b 150.0
c 35.0
total 220.0
2 X a 48.0
b 100.0
c 50.0
total 198.0
Y a 80.0
total 80.0

Try this: Basically, this is creating two new dfs from the using the sum of the two groups and concating the three data frames
level = list(map(int, list('111112222')))
company = list('XXYYYXXXY')
item = list('ababcabca')
value = [100,200,35,150,35,48,100,50,80]
col = ['Level', 'Company', 'Item', 'Value']
df = pd.DataFrame([level,company,item,value]).T
df.columns = col
df1 = (df.groupby(['Level', 'Company', 'Item'])['Value'].sum())
df2 = (df1.sum(level=0).to_frame().assign(Company='total').set_index('Company', append=True))
df3 = (df1.groupby(['Level','Company']).sum().to_frame().assign(Item='total').set_index('Item', append=True))
dfx = pd.concat([df1.to_frame().reset_index(),
df2.reset_index(),
df3.reset_index()],sort=False)
print(dfx)
Output:
Level Company Item Value
0 1 X a 100
1 1 X b 200
2 1 Y a 35
3 1 Y b 150
4 1 Y c 35
5 2 X a 48
6 2 X b 100
7 2 X c 50
8 2 Y a 80
0 1 total NaN 520
1 2 total NaN 278
0 1 X total 300
1 1 Y total 220
2 2 X total 198
3 2 Y total 80
This is not sorted though as you expect.
If I concat the 3 dfs without resetting the index I'm getting the expected sort order, but the index is a multi-index column
dfx = pd.concat([df1.to_frame(), df2, df3]).sort_index()
Output
Value
(1, X, a) 100
(1, X, b) 200
(1, X, total) 300
(1, Y, a) 35
(1, Y, b) 150
(1, Y, c) 35
(1, Y, total) 220
(1, total) 520
(2, X, a) 48
(2, X, b) 100
(2, X, c) 50
(2, X, total) 198
(2, Y, a) 80
(2, Y, total) 80
(2, total) 278
I am not sure how to convert this to columns as in your df.

You can try stacking it one level at a time:
m = df.groupby(['Level','Company','Item'])['Value'].sum().unstack(level=['Company','Item'])
m = m.assign(total=m.sum(1))
m = m.stack(level='Company')
m = m.assign(total=m.sum(1))
m = m.stack(level='Item')
The output has duplicate totals though:
Level Company Item
1 X a 100.0
b 200.0
total 300.0
Y a 35.0
b 150.0
c 35.0
total 220.0
total 520.0
total 520.0
2 X a 48.0
b 100.0
c 50.0
total 198.0
Y a 80.0
total 80.0
total 278.0
total 278.0
dtype: float64

Related

Pandas Group By With Overlapping Bins

I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).
Starting with a df like this
id
awards
age
1
100
24
1
150
26
1
50
54
2
193
34
2
209
50
I am using the code from this answer to calculate summation across non-overlapping bins.
bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)
print(cuts)
0 [18, 27)
1 [18, 27)
2 [54, 63)
3 [27, 36)
4 [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]
df_out = (df.groupby(['id', cuts])
.agg(total_awards=('awards', 'sum'))
.reset_index(level=0)
.reset_index(drop=True)
)
df_out['age_interval'] = df_out.groupby('id').cumcount()
Result
print(df_out)
id total_awards age_interval
0 1 0 0
1 1 0 1
2 1 250 2
3 1 0 3
4 1 0 4
5 1 0 5
6 1 50 6
7 2 0 0
8 2 0 1
9 2 0 2
10 2 193 3
11 2 0 4
12 2 209 5
13 2 0 6
Is it possible to work off the existing code to do this with overlapping bins?
First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.
interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
df.pivot_table(index='id', columns='age', values='awards',
aggfunc=sum, fill_value=0)
.reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
.rolling(interval, axis=1, min_periods=interval).sum()
.rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
.stack()
.reset_index(name='awards')
)
and you get with the input data provided in the question
print(res)
# id age awards
# 0 1 0-8 y.o. 0.0
# 1 1 1-9 y.o. 0.0
# ...
# 15 1 15-23 y.o. 0.0
# 16 1 16-24 y.o. 100.0
# 17 1 17-25 y.o. 100.0
# 18 1 18-26 y.o. 250.0
# 19 1 19-27 y.o. 250.0
# 20 1 20-28 y.o. 250.0
# 21 1 21-29 y.o. 250.0
# 22 1 22-30 y.o. 250.0
# 23 1 23-31 y.o. 250.0
# 24 1 24-32 y.o. 250.0
# 25 1 25-33 y.o. 150.0
# 26 1 26-34 y.o. 150.0
# 27 1 27-35 y.o. 0.0
# ...
# 45 1 45-53 y.o. 0.0
# 46 1 46-54 y.o. 50.0
# 47 1 47-55 y.o. 50.0
# 48 1 48-56 y.o. 50.0
# 49 1 49-57 y.o. 50.0
# ...
I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.
>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0 6
1 2
2 4
3 6
4 4
..
98 0
99 0
100 0
101 0
102 0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8 42
1-9 43
2-10 45
3-11 47
4-12 47
..
90-98 31
91-99 27
92-100 20
93-101 13
94-102 8
Name: age, Length: 95, dtype: int64
This is slightly complicated by the fact you also want to group by id, but the idea stays the same:
>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
awards
1 0 128
1 204
2 136
3 367
4 387
... ...
2 98 0
99 0
100 0
101 0
102 0
[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
id awards
age
8 1 3112
9 1 3390
10 1 3431
11 1 3609
12 1 3820
.. .. ...
98 2 1786
99 2 1226
100 2 900
101 2 561
102 2 317
[190 rows x 2 columns]
This is the same as #Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.
IIUC, you can use pd.IntervalIndex with some list comprehension:
ii = pd.IntervalIndex.from_tuples(
[
(s, e)
for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
]
)
df_out = pd.concat(
[
pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
for i, x in df[["age", "awards"]].iterrows()
],
axis=1,
).groupby(level=0).sum().T
df_out.stack()
Output:
0 (0.0, 8.0] 0
(1.0, 9.0] 0
(2.0, 10.0] 0
(3.0, 11.0] 0
(4.0, 12.0] 0
...
4 (38.0, 46.0] 0
(39.0, 47.0] 0
(40.0, 48.0] 0
(41.0, 49.0] 0
(42.0, 50.0] 209
Length: 215, dtype: int64
A old way without pd.cut using a for loop and some masks.
import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
max_age = min_age + interval_length
awards = df.query("#min_age <= age <= #max_age").loc[:, "age"].sum()
values.append([min_age, max_age, awards])
df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])
Let me know if this is what you want :)
Let df be a DataFrame:
import pandas as pd
import random
def r(b, e):
return [random.randint(b, e) for _ in range(300)]
df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})
For binning by age, I would advise creating a new column since it is clearer (and faster):
df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)
The number of awards per id per bin can be obtained using simply:
totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)
If I understand correctly, you would like the sum for each window of size 9 rows:
totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html

Weighted Mean as a Column in Pandas

I am trying to add a column with the weighted average of 4 columns with 4 columns of weights
df = pd.DataFrame.from_dict(dict([('A', [2000, 1000, 2509, 2145]),
('A_Weight', [37, 47, 33, 16]),
('B', [2100, 1500, 2000, 1600]),
('B_weights', [17, 21, 6, 2]),
('C', [2500, 1400, 0, 2300]),
('C_weights', [5, 35, 0, 40]),
('D', [0, 1600, 2100, 2000]),
('D_weights', [0, 32, 10, 5])]))
I want the weighted average to be in a new column named "WA" but every time I try it displays NaN
Desired Dataframe would be a new column with the following values as ex:
Formula I used (((A * A_weight)+(B * b_weight)+(C * C_weight)+(D * D_weight)) / sum(all weights)
df['WA'] = [2071.19,1323.70, 2363.20,2214.60 ]
Thank you
A straight-forward and simple way to do is as follows:
(Since your columns name for the weights are not consistently named, e.g. some with 's' and some without, some with capital 'W' and some with lower case 'w', it is not convenient to group columns e.g. by .filter())
df['WA'] = ( (df['A'] * df['A_Weight']) + (df['B'] * df['B_weights']) + (df['C'] * df['C_weights']) + (df['D'] * df['D_weights']) ) / (df['A_Weight'] + df['B_weights'] + df['C_weights'] + df['D_weights'])
Result:
print(df)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
The not so straight-forward way:
Group columns by prefix via str.split
get the column-wise product via groupby prod
get the row-wise sum of the products with sum on axis 1.
filter + sum on axis 1 to get sum of "weights" columns
Divide the the group product sums with the weight sums.
df['WA'] = (
df.groupby(df.columns.str.split('_').str[0], axis=1).prod().sum(axis=1)
/ df.filter(regex='_[wW]eight(s)?$').sum(axis=1)
)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
Another option to an old question:
Split data into numerator and denominator:
numerator = df.filter(regex=r"[A-Z]$")
denominator = df.filter(like='_')
Convert denominator into a MultiIndex, comes in handy when computing with numerator:
denominator.columns = denominator.columns.str.split('_', expand = True)
Multiply numerator by denominator, and divide the sum of the outcome with the sum of the denominator:
outcome = numerator.mul(denominator, level=0, axis=1).sum(1)
outcome = outcome.div(denominator.sum(1))
df.assign(WA = outcome)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175

Summing by string names Pandas

I'm working with a data frame like this, but bigger and with more zone. I am trying to sum the value of the rows by their names. The total sum of the R or C zones goes in total column while the total sum of either M zones goes in total1 .
Input:
total, total1 are the desired output.
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1
1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
3 C2 40 4 C4 60 6 0 6 0 10 0
3 C1 100 8 0 0 0 0 100 0 8 0
5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
You can use filter for DataFrames for Zones and Values:
z = df.filter(like='Zone')
v = df.filter(like='Value')
Then create boolean DataFrames by contains with apply if want check substrings:
m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))
#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])
Last filter by where v and sum per rows:
df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)
print (df)
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 t t1
0 1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
2 3 C2 40 4 C4 60 6 0 6 0 10 0
3 3 C1 100 8 0 0 0 0 100 0 8 0
4 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
Solution1 (simpler code but slower and less flexible)
total = []
total1 = []
for i in range(df.shape[0]):
temp = df.iloc[i].tolist()
if "R2" in temp:
total.append(temp[temp.index("R2")+1])
else:
total.append(0)
if ("C1" in temp) & ("C4" in temp):
total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
else:
total1.append(0)
df["Total"] = total
df["Total1"] = total1
Solution2 (faster than solution1 and easier to customize but possibly memory intensive)
# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]
# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2
original = df.copy()
# bucket1 check
for zone in zones:
df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0
original['Total'] = df[vals].sum(axis=1)
df = original.copy()
# bucket2 check
for zone in zones:
df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0
df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)
When I expanded original dataframe to 100k rows, solution 1 took 11.4 s ± 82.1 ms per loop, while solution 2 took 3.53 s ± 29.8 ms per loop. The difference is because solution 2 does not for-looping over row direction.

Getting sum of metrics for combinations of multiple columns

I have a pandas data frame that looks like this.
set language group version metric_1 metric_2 metric_3
X English 1 A 100 20 5
X French 2 A 90 10 10
X English 1 B 80 30 15
X French 2 B 70 20 20
Y English 1 A 200 20 30
Y French 2 A 180 30 20
Y English 1 B 160 10 10
Y French 2 B 140 20 5
I want to summarize the metrics with all the combinations of the experiment attributes - set, language, group & version. So the summary data frame will look like this.
set language group version metric_1 metric_2 metric_3
X 800 140 80
Y 1000 140 80
English 1200 200 80
French 600 80 80
1 1050 120 60
2 750 160 100
A 850 140 80
B 950 140 80
X English 500 100 40
X French 300 40 40
Y English 700 100 40
Y French 300 40 40
X 1 350 60 30
X 2 450 80 50
Y 1 700 60 30
Y 2 300 80 50
X A 350 70 40
X B 450 70 40
Y A 500 70 40
Y B 500 70 40
English 1 ...
English 2 ...
French 1 ...
French 2 ...
English A ...
English B ...
French A ...
French B ...
1 A ...
1 B ...
2 A ...
2 B ...
X English 1 ...
X English 2 ...
X French 1 ...
X French 2 ...
Y English 1 ...
Y English 2 ...
Y French 1 ...
Y French 2 ...
X English A ...
X English B ...
X French A ...
X French B ...
Y English A ...
Y English B ...
Y French A ...
Y French B ...
X 1 A ...
X 1 B ...
X 2 A ...
X 2 B ...
Y 1 A ...
Y 1 B ...
Y 2 A ...
Y 2 B ...
English 1 A ...
English 1 B ...
English 2 A ...
English 2 B ...
French 1 A ...
French 1 B ...
French 2 A ...
French 2 B ...
I know I can do this brute force with using the different combinations with groupby and concatenating all of those into a single data frame. This can potentially extend to many more attributes, so I'm trying to find a solution that is more scalable. I've been reading about functions available through itertools, but not sure how they would apply.
Appreciate any ideas / directions on this. Thanks!
Indeed, the function combinations of itertools will help you to create all combinations you can. Let's suppose your data is in a dataframe called df.
from itertools import combinations
# create two list, one for all columns you want to sum, and the others
list_metric = [col for col in df.columns if 'metric' in col]
list_non_metric = [col for col in df.columns if 'metric' not in col]
# create the dataframe grouped on all columns
df_grouped = df.groupby(list_non_metric,as_index=False)[list_metric].sum()
# use concat and list comprehension to create all the combinations
df_output = (pd.concat([df_grouped.groupby(list(combi),as_index=False)[list_metric].sum()
for j in range(1, len(list_non_metric)+1)
for combi in combinations(list_non_metric,j) ])
.fillna(''))
# reorder the columns as the input data (if necessary)
df_output = df_output[df.columns]
If you want to understand how works combinations, try to print these lines:
[combi for combi in combinations(list_non_metric,2)]
and then the second for j in range(1, len(list_non_metric)+1) will help to create the combinations of 1, 2, 3, ... elements of list_non_metric
Here's one way to do it. I'm assuming you've only given a subset of your data because the totals don't add up:
In []:
import itertools as it
cols = df.columns.tolist()
index = ['set', 'language', 'group', 'version']
df = df.set_index(index)
pd.concat([df.groupby(level=x).sum().reset_index()
for n in range(1, len(index)+1)
for x in it.combinations(range(len(index)), n)],
sort=True)[cols].fillna('')
Out[]:
set language group version metric_1 metric_2 metric_3
0 X 340 80 50
1 Y 680 80 65
0 English 540 80 60
1 French 480 80 55
0 1 540 80 60
1 2 480 80 55
0 A 570 80 65
1 B 450 80 50
0 X English 180 50 20
1 X French 160 30 30
2 Y English 360 30 40
3 Y French 320 50 25
...

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

Categories

Resources