I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful
As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))
Related
How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.
I have some chemical data that I'm trying to process using Pandas. I have two dataframes:
C_atoms_all.head()
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
1 218 2 C C [4.8792, 11.9845, 14.6312]
2 219 3 C C [4.8373, 10.7563, 13.9466]
3 220 4 C C [4.7366, 10.9327, 12.5408]
4 6573 5 C C [1.9482, -3.8747, 19.6319]
C_atoms_a.head()
id_a index_a label_a species_a position
0 55 1 C C [6.609, 6.6024, 19.3302]
1 56 2 C C [4.8792, 11.9844, 14.6313]
2 57 3 C C [4.8372, 10.7565, 13.9467]
3 58 4 C C [4.7367, 10.9326, 12.5409]
4 59 5 C C [5.1528, 15.5976, 14.1249]
What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.
The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.
I'd ideally like to return a dataframe that looks like so:
id_all id_a position
217 55 [6.609, 6.6024, 19.3301]
... ... ...
for every row where the position values match.
You can do it like below:
I named your C_atoms_all as df_all and C_atoms_a as df_a:
# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]
# Then because the position values match (within a small fudge factor)
# we round them with three decimal
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3)
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.
# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
suffixes=(None, "_y"))
# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]
print(df)
id_all id_a position
0 217 55 [6.609, 6.6024, 19.3301]
1 218 56 [4.8792, 11.9845, 14.6312]
2 219 57 [4.8373, 10.7563, 13.9466]
3 220 58 [4.7366, 10.9327, 12.5408]
This isn't pretty, but it might work for you
def do(x, df_a):
try:
return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
except StopIteration:
return np.NAN
match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))
You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.
# Custom hash function
def hash_position(position):
return hash(tuple(position))
# Create the hash column "hashed_position"
C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)
# merge datasets
C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')
# ... keep the columns you need
Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.
Data
C_atoms_all = pd.DataFrame({
'id_all': [217,218,219,220,6573],
'index_all': [1,2,3,4,5],
'label_all': ['C','C','C','C','C'],
'species_all': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})
C_atoms_a = pd.DataFrame({
'id_a': [55,56,57,58,59],
'index_a': [1,2,3,4,5],
'label_a': ['C','C','C','C','C'],
'species_a': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab
Solution
#new dataframe bringing together columns position
df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()
#Create temp column that gives you the comparison tolerances
df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)
#Assume tolerance is where only one of the values is over 0.0
C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)
I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')
I'm relatively new to Pandas dataframes and I have to do simple calculation, but so far I haven't found a good way to go about it.
Basically what I have is:
type group amount
1 A real 55
2 A fake 12
3 B real 610
4 B fake 23
5 B real 45
Now, I have to add a new column that would show the percentage of fakes in type total. So the simple formula for this table would be for A 12 / (55 + 12) * 100 and for B 23 / (610 + 23 + 45) * 100 and the table should look something like this:
type group amount percentage
1 A real 55
2 A fake 12 17.91
3 B real 610
4 B fake 23
5 B real 45 3.39
I know about groupby statements and basically all the components I need for this (I guess...), but can't figure out how to combine to get this result.
df['percentage'] = df.amount \
/ df.groupby(['type']) \
.amount.transform('sum').loc[df.group.eq('fake')]).fillna('')
df
If handling multiple fake in group per type. We can be a bit more careful. I'll set the index to preserve the type and group columns while I transform.
c = ['type', 'group']
d1 = df.set_index(c, append=True)
d1.amount /= d1.groupby(level=['type']).amount.transform('sum')
d1.reset_index(c)
From here, you can choose to leave that alone or consolidate the group column.
d1.groupby(level=c).sum().reset_index()
Try this out:
percentage = {}
for type in df.type.unique():
numerator = df[(df.type == type) & (df.group == 'fake')].amount.sum()
denominator = df[(df.type == type)].amount.sum()
percentage[type] = numerator / denominator * 100
df['percentage'] = list(df.type.map(percentage))
If you wanted to make sure you accounted for multiple fake groups per type you can do the following
type_group_total = df.groupby(['type', 'group']).transform('sum')
type_total = df.groupby('type')[['amount']].transform('sum')
df['percentage'] = type_group_total / type_total
Output
type group amount percentage
0 A real 55 0.820896
1 A fake 12 0.179104
2 B real 610 0.899705
3 B fake 23 0.100295
4 B fake 45 0.100295
I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
VoilĂ !
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)