I am creating a DataFrame with the code below:
import pandas as pd
df1= pd.DataFrame({'segment': ['abc','abc','abc','abc','abc','xyz','xyz','xyz','xyz','xyz','xyz','xyz'],
'prod_a_clients': [5,0,12,25,0,2,5,24,0,1,21,7],
'prod_b_clients': [15,6,0,12,8,0,17,0,2,23,15,0] })
abc_seg= df1[(df1['segment']=='abc')]
xyz_seg= df1[(df1['segment']=='xyz')]
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_a_clients']>0)]
abc_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_b_clients']>0)]
abc_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_a_clients']>0)]
xyz_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_b_clients']>0)]
xyz_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
segs_combined= [abc_seg,xyz_seg]
df2= pd.concat(segs_combined, ignore_index=True)
print(df2)
As you can see from the result I need to calculate a mean for every product and segment combination I have. I'm going to be doing this for 100s of products and segments. I have tried many different ways of doing this with a loop or a function and have gotten close with something like the following:
def prod_seg(sg,prd):
seg_prod= df1[(df1['segment']==sg) & (df1[prd+'_clients']>0)]
prod_name= prd+'_clients'
col_name= prd+'_average'
df_name= sg+'_seg'
df_name+"['"+prd+'_average'+"']"=statistics.mean(seg_prod[prod_name])
return
The issue is that I need to create a unique column for every iteration and the way I am doing it above is obviously not working.
Is there any way I can recreate what I did above in a loop or function?
You could use groupby in order to calculate the mean per group. Also, replace the 0 with nan and it gets skipped by the mean calculation. The script then looks like:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'segment': ['abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz',
'xyz', 'xyz', 'xyz'],
'prod_a_clients': [5, 0, 12, 25, 0, 2, 5, 24, 0, 1, 21, 7],
'prod_b_clients': [15, 6, 0, 12, 8, 0, 17, 0, 2, 23, 15, 0]})
df1.set_index("segment", inplace=True, drop=True)
df1[df1 == 0] = np.nan
mean_values = dict()
for seg_key, seg_df in df1.groupby(level=0):
mean_value = seg_df.mean(numeric_only=True)
mean_values[seg_key] = mean_value
results = pd.DataFrame.from_dict(mean_values)
print(results)
The results is:
abc xyz
prod_a_clients 14.00 10.00
prod_b_clients 10.25 14.25
Instead of using a loop, you can derive the same result by first using where on the 0s in the clients columns (which replaces 0s with NaN); then groupby the "segments" column and transform the "mean" method.
The point of where is that mean method by default skips NaN values, so by converting 0s with NaN, we make sure 0s are not considered for the mean.
transform(mean) transforms the mean (which is an aggregate value) to align with the original DataFrame, so every row has a matching mean value.
clients = ['prod_a_clients', 'prod_b_clients']
out = (df1.join(df1[['segment']]
.join(df1[clients].where(df1[clients]>0))
.groupby('segment').transform('mean')
.add_suffix('_mean')))
Output:
segment prod_a_clients prod_b_clients prod_a_clients_mean prod_b_clients_mean
0 abc 5 15 14.0 10.25
1 abc 0 6 14.0 10.25
2 abc 12 0 14.0 10.25
3 abc 25 12 14.0 10.25
4 abc 0 8 14.0 10.25
5 xyz 2 0 10.0 14.25
6 xyz 5 17 10.0 14.25
7 xyz 24 0 10.0 14.25
8 xyz 0 2 10.0 14.25
9 xyz 1 23 10.0 14.25
10 xyz 21 15 10.0 14.25
11 xyz 7 0 10.0 14.25
Related
My question is very similar to here, except I would like to round to closest, instead of always round up, so cut() doesn't seem to work.
import pandas as pd
import numpy as np
df = pd.Series([11,16,21, 125])
rounding_logic = pd.Series([15, 20, 100])
labels = rounding_logic.tolist()
rounding_logic = pd.Series([-np.inf]).append(rounding_logic) # add infinity as leftmost edge
pd.cut(df, rounding_logic, labels=labels).fillna(rounding_logic.iloc[-1])
The result is [15,20,100,100], but I'd like [15,15,20,100], since 16 is closest to 15 and 21 closest to 20.
You can try pandas.merge_asof with direction=nearest
out = pd.merge_asof(df.rename('1'), rounding_logic.rename('2'),
left_on='1',
right_on='2',
direction='nearest')
print(out)
1 2
0 11 15
1 16 15
2 21 20
3 125 100
Get the absolute difference for the values, and take minimum value from rounding_logic:
>>> rounding_logic.reset_index(drop=True, inplace=True)
>>> df.apply(lambda x: rounding_logic[rounding_logic.sub(x).abs().idxmin()])
0 15.0
1 15.0
2 20.0
3 100.0
dtype: float64
PS: You need to reset the index in rounding_logic cause you have duplicate index after adding -inf to the start of the series.
I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).
Would appreciate recommendations!
Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
testing speed
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase+ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N+1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Can try this. It should generalize to arbitrary number of columns.
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
[2, 'E', 3, 4, 5, 6, 7],
[3, 'A', 9, 6, 3, 8, 1],
[4, 'C', 13, 5, 88, 9, 0],
[5, 'B', 6, 4, 6, 9, 4]])
df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']
for val in np.unique(df['bar'].values):
df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']
To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long (instead of melt/stack) because of the patterned nature of your column names:
out = (
pd.wide_to_long(
df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w+'
)
.loc[lambda d:
d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
'ID'
]
.droplevel(-1)
.rename('baz')
.reset_index()
)
print(out)
foo bar baz
0 1 B 2.3
1 2 E 7.0
2 3 A 9.0
3 4 C 88.0
4 5 B 4.0
An alternative syntax to the above leverages .melt & .query to shorten the syntax.
out = (
df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
.assign(id=lambda d: d['id'].str.get(-1))
.query('bar == id')
)
print(out)
foo bar id baz
2 3 A A 9.0
5 1 B B 2.3
9 5 B B 4.0
13 4 C C 88.0
21 2 E E 7.0
This question is based on Python - Pandas - Combining rows of multiple columns into single row in dataframe based on categorical value which I had asked earlier.
I have a table in the following format:
Var1 Var2 Var3 Var4 ID
0 0.70089 0.93120 1.867650 0.658020 1
1 0.15893 -0.74950 1.089150 -0.045123 1
2 0.13690 0.59210 -0.032990 0.672860 1
3 -0.50136 0.89913 0.440200 0.812150 1
4 1.08940 0.43036 0.669470 1.286000 1
5 0.09310 0.14979 -0.392335 0.040500 1
6 7 0.63339 1.27161 0.852072 0.474800 2
7 8 -0.54944 -0.04547 0.867050 -0.234800 2
8 9 1.28600 1.87650 0.976670 0.440200 2
I have created the above table using the using the following code:
import pandas as pd
df1 = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402] 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.Dataframe(data=df1)
I want to bring it into a particular format by grouping it based on the column 'ID'.
The desired output is similar in structure to the table below:
ID V1_0_0 V2_0_1 V3_0_2 V4_0_3 V1_1_0 V2_1_1 V3_1_2 V4_1_3
1 A B C D E F G H
2 I J K L 0 0 0 0
I achieved it with the help of user Allen in the last question that is referenced above. The code is printed below:
num_V = 4
max_row = df.groupby('ID').ID.count().max()
df= df.groupby('ID').apply(lambda x: x.values[:,1:].reshape(1,-1)
[0].apply(lambda x: x.values[:,1:].reshape(1,-1)[0]).apply(pd.Series)
.fillna(0)
df.columns = ['V{}_{}_{}'.format(i+1,j,i) for j in range(max_row) for i in
range(num_V)]
print(df)
The result of which produces the below output table:
V1_0_0 V2_0_1 V3_0_2 ***V4_0_3** V1_1_0 V2_1_1 V3_1_2 \
ID
1 0.93120 1.867650 0.65802 1 -0.74950 1.08915 -0.045123
2 1.27161 0.852072 0.47480 2 -0.04547 0.86705 -0.234800
**V4_1_3*** V1_2_0 V2_2_1 ...V3_3_2 **V4_3_3** V1_4_0 V2_4_1 \
ID ...
1 1 0.5921 -0.03299 ... 0.81215 1 0.43036 0.66947
2 2 1.8765 0.97667 ... 0.00000 0 0.00000 0.00000
V3_4_2 **V4_4_3** V1_5_0 V2_5_1 V3_5_2 **V4_5_3**
ID
1 1.286 1 0.14979 -0.392335 0.0405 1
2 0.000 0 0.00000 0.000000 0.0000 0
This is partially correct, but the problem is that there are certain columns that give the value of 1 and 2 after every 3 columns (the ones between ** **).
It then prints 1 and 0 after there are no values pertaining to the 'ID' value 2.
After examining it I realize that it is not printing the "Var1" values, and the values are off by one column. (That is V1_0_0 should be 0.70089, and the real value of V4_0_3should have the value of V3_0_2 which equals 0.65802.
Is there any way to rectify this so that I get something exactly like my desired output table? How do I make sure the ** ** marked columns delete the values they have and return the proper values?
I am using Python 3.4 running it on a Linux Terminal
Thanks.
not sure whats wrong with the code you have provided, but try this out and let me know if it gives you what you want:
import pandas as pd
df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.DataFrame(df)
newdataframe=pd.DataFrame(columns=df.columns)
newID=[]
for agroup in df.ID.unique():
temp_df=pd.DataFrame(columns=df.columns)
adf=df[df.ID==agroup]
for aline in adf.itertuples():
a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
a.columns=df.columns
if a.ID.values[0] not in newID:
suffix_count=1
temp_df=pd.concat([temp_df,a])
newID.append(a.ID.values[0])
else:
temp_df = temp_df.merge(a, how='outer', on='ID', suffixes=('', '_'+ str(suffix_count)))
suffix_count += 1
newdataframe=pd.concat([newdataframe,temp_df])
print (newdataframe)
Output :
ID Var1 Var1_1 Var1_2 Var1_3 Var1_4 Var1_5 Var2 Var2_1 \
0 1.0 0.70089 0.15893 0.1369 -0.50136 1.0894 0.0931 0.93120 -0.74950
0 2.0 0.63339 -0.54944 1.2860 NaN NaN NaN 1.27161 -0.04547
Var2_2 ... Var3_2 Var3_3 Var3_4 Var3_5 Var4 Var4_1 \
0 0.5921 ... -0.03299 0.4402 0.66947 -0.392335 0.65802 -0.045123
0 1.8765 ... 0.97667 NaN NaN NaN 0.47480 -0.234800
Var4_2 Var4_3 Var4_4 Var4_5
0 0.67286 0.81215 1.286 0.0405
0 0.44020 NaN NaN NaN
another code for achieving the output you looking for:
import pandas as pd
import numpy as np
import re
df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.DataFrame(df)
df['duplicateID']=df['ID'].duplicated()
newdf=df[df['duplicateID']==False]
newdf=newdf.reset_index()
newdf=newdf.iloc[:,1:]
df=df[df['duplicateID']==True]
df=df.reset_index()
df=df.iloc[:,1:]
del newdf['duplicateID']
del df['duplicateID']
merge_count=0
newID=[]
for aline in df.itertuples():
a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
a.columns=df.columns
newdf=newdf.merge(a, how='left', on ='ID', suffixes=('_'+str(merge_count),'_'+str(merge_count+1)))
merge_count+=1
newdf.index=newdf['ID']
del newdf['ID']
newdf.columns=[col+'_'+str(int(re.findall('\d+',col)[0])-1) for col in newdf.columns]
print newdf
I'm trying to vectorize a for loop in pandas to improve performance. I have a dataset comprising of users, products, the date of each service as well as the number of days supplied. Given the following subset of data:
testdf = pd.DataFrame(data={"USERID": ["A"] * 6,
"PRODUCTID": [1] * 6,
"SERVICEDATE": [datetime(2016, 1, 1), datetime(
2016, 2, 5),
datetime(2016, 2, 28), datetime(2016, 3, 25),
datetime(2016, 4, 30), datetime(2016, 5, 30)],
"DAYSSUPPLY": [30] * 6})
testdf=testdf.set_index(["USERID", "PRODUCTID"])
testdf["datediff"] = testdf["SERVICEDATE"].diff()
testdf.loc[testdf["datediff"].notnull(), "datediff"] = testdf.loc[
testdf["datediff"].notnull(), "datediff"].apply(lambda x: x.days)
testdf["datediff"] = testdf["datediff"].fillna(0)
testdf["datediff"] = pd.to_numeric(testdf["datediff"])
testdf["over_under"] = testdf["DAYSSUPPLY"].shift() - testdf["datediff"]
I would like to get the following result:
DAYSSUPPLY SERVICEDATE datediff over_under desired
USERID PRODUCTID
A 1 30 2016-01-01 0 NaN 0
1 30 2016-02-05 35 -5.0 0
1 30 2016-02-28 23 7.0 7
1 30 2016-03-25 26 4.0 11
1 30 2016-04-30 36 -6.0 5
1 30 2016-05-30 30 0.0 5
Essentially, I want my desired column to be the running sum of over_under, but to only sum the negative values if the value of desired on the previous line is > 0. desired should never get below 0. A quick and dirty loop over a [user, product] group looks something like this:
running_total = 0
desired_loop = []
for row in testdf.itertuples():
over_under=row[4]
# skip first row
if pd.isnull(over_under):
desired_loop.append(0)
continue
running_total += over_under
running_total = max(running_total, 0)
desired_loop.append(running_total)
testdf["desired_loop"] = desired_loop
desired_loop
USERID PRODUCTID
A 1 0.0
1 0.0
1 7.0
1 11.0
1 5.0
1 5.0
I'm still new to vectorization and pandas and general. I've been able to vectorize every other calculation in this df, but this special case of a cumulative sum I just can't figure out how to go about it.
Thanks!
I had a similar problem and solved it using a somewhat unconventional iteration.
testdf["desired"] = testdf["over_under"].cumsum()
current = np.argmax( testdf["desired"] < 0 )
while current != 0:
testdf.loc[current:,"desired"] += testdf["desired"][current] # adjust the cumsum going forward
# the previous statement also implicitly sets
# testdf.loc[current, "desired"] = 0
current = np.argmax( testdf["desired"][current:] < 0 )
In essence, you are finding all the "events" and readjusting the running cumsum over time. All of the manipulation and test operations are vectorized, so if your desired column doesn't cross negative too often, you should be pretty fast.
It's definitely a hack but it got the job done for me.
I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)