I need help adding the percent distribution of the total (no decimals) in each section of a stacked bar plot in pandas created from a crosstab in a dataframe.
Here is sample data:
data = {
'Name':['Alisa','Bobby','Bobby','Alisa','Bobby','Alisa',
'Alisa','Bobby','Bobby','Alisa','Bobby','Alisa'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','English','English','Science','Science',
'Mathematics','Mathematics','English','English','Science','Science'],
'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}
df = pd.DataFrame(data)
# display(df)
Name Exam Subject Result
0 Alisa Semester 1 Mathematics Pass
1 Bobby Semester 1 Mathematics Pass
2 Bobby Semester 1 English Fail
3 Alisa Semester 1 English Pass
4 Bobby Semester 1 Science Fail
5 Alisa Semester 1 Science Pass
6 Alisa Semester 2 Mathematics Pass
7 Bobby Semester 2 Mathematics Fail
8 Bobby Semester 2 English Fail
9 Alisa Semester 2 English Pass
10 Bobby Semester 2 Science Pass
11 Alisa Semester 2 Science Fail
Here is my code:
#crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax.plot.bar(figsize=(10,10),stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='best', bbox_to_anchor=(0.1, 1.0),title="Subject",)
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
plt.show()
I know I need to add a plt.text some how, but can't figure it out. I would like the percent of the totals to be embedded within the stacked bars.
Let's try:
# crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax_1 = ax.plot.bar(figsize=(10,10), stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='upper center', bbox_to_anchor=(0.1, 1.0), title="Subject")
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
for rec in ax_1.patches:
height = rec.get_height()
ax_1.text(rec.get_x() + rec.get_width() / 2,
rec.get_y() + height / 2,
"{:.0f}%".format(height),
ha='center',
va='bottom')
plt.show()
Output:
Subject English Mathematics Science
Name
Alisa 33.333333 33.333333 33.333333
Bobby 33.333333 33.333333 33.333333
From matplotlib 3.4.2 use matplotlib.pyplot.bar_label
See this answer for a thorough explanation of using the method, and for additional examples.
Using label_type='center' will annotate with the value of each segment, and label_type='edge' will annotate with the cumulative sum of the segments.
It is easiest to plot stacked bars using pandas.DataFrame.plot with kind='bar' and stacked=True
To get the percent in a vectorized manner (without .apply):
Get the frequency count using pd.crosstab
Divide ct along axis=0 by ct.sum(axis=1)
It is important to specify the correct axis with .div and .sum.
Multiply by 100, and round.
This is best done using .crosstab because it results in a dataframe with the correct shape for plotting the stacked bars. .groupby would require further reshaping of the dataframe.
Tested in python 3.10, pandas 1.3.4, matplotlib 3.5.0
import pandas as pd
import matplotlib.pyplot as plt
# get a frequency count using crosstab
ct = pd.crosstab(df['Name'], df['Subject'])
# vectorized calculation of the percent per row
ct = ct.div(ct.sum(axis=1), axis=0).mul(100).round(2)
# display(ct)
Subject English Mathematics Science
Name
Alisa 33.33 33.33 33.33
Bobby 33.33 33.33 33.33
# specify custom colors
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
# plot
ax = ct.plot(kind='bar', figsize=(10, 10), stacked=True, rot=0, color=pal, xlabel='Name', ylabel='Percent Distribution')
# move the legend
ax.legend(title='Subject', bbox_to_anchor=(1, 1.02), loc='upper left')
# iterate through each bar container
for c in ax.containers:
# add the annotations
ax.bar_label(c, fmt='%0.0f%%', label_type='center')
plt.show()
Using label_type='edge' annotates with the cumulative sum
Related
I have a pandas DataFrame containing the percentage of students that have a certain skill in each subject stratified according to their gender
iterables = [['Above basic','Basic','Low'], ['Female','Male']]
index = pd.MultiIndex.from_product(iterables, names=["Skills", "Gender"])
df = pd.DataFrame(data=[[36,36,8,8,6,6],[46,46,2,3,1,2],[24,26,10,11,16,13]], index=["Math", "Literature", "Physics"], columns=index)
print(df)
Skill Above basic Basic Low
Gender Female Male Female Male Female Male
Math 36 36 8 8 6 6
Literature 46 46 2 3 1 2
Physics 24 26 10 11 16 13
Next I want to see how the skills are distributed according to the subjects
#plot how the skills are distributed according to the subjects
df.sum(axis=1,level=[0]).plot(kind='bar')
df.plot(kind='bar')
Now I would like to add the percentage of Male and Female to each bar in a stacked manner.. eg. for the fist bar ("Math", "Above basic") it should be 50/50. For the bar ("Literature", "Basic") it should be 40/60, for the bar ("Literature","Low") it should be 33.3/66.7 and so on...
Could you give me a hand?
Using the level keyword in DataFrame and Series aggregations, df.sum(axis=1,level=[0]), is deprecated.
Use df.groupby(level=0, axis=1).sum()
df.div(dfg).mul(100).round(1).astype(str) creates a DataFrame of strings with the 'Female' and 'Male' percent for each of the 'Skills', which can be used to create a custom bar label.
As shown in this answer, use matplotlib.pyplot.bar_label to annotate the bars, which has a labels= parameter for custom labels.
Tested in python 3.11, pandas 1.5.3, matplotlib 3.7.0, seaborn 0.12.2
# group df to create the bar plot
dfg = df.groupby(level=0, axis=1).sum()
# calculate the Female / Male percent for each Skill
percent_s = df.div(dfg).mul(100).round(1).astype(str)
# plot the bars
ax = dfg.plot(kind='bar', figsize=(10, 7), rot=0, width=0.9, ylabel='Total Percent\n(Female/Male split)')
# iterate through the bar containers
for c in ax.containers:
# get the Skill label
label = c.get_label()
# use the Skill label to get the current group based on level, join the strings,and get an array of custom labels
labels = percent_s.loc[:, percent_s.columns.get_level_values(0).isin([label])].agg('/'.join, axis=1).values
# add the custom labels to the center of the bars
ax.bar_label(c, labels=labels, label_type='center')
# add total percent to the top of the bars
ax.bar_label(c, weight='bold', fmt='%g%%')
percent_s
Skills Above basic Basic Low
Gender Female Male Female Male Female Male
Math 50.0 50.0 50.0 50.0 50.0 50.0
Literature 50.0 50.0 40.0 60.0 33.3 66.7
Physics 48.0 52.0 47.6 52.4 55.2 44.8
Optionally, melt df into a long form, and use sns.catplot with kind='bar' to plot each 'Gender' in a separate Facet.
# melt df into a long form
dfm = df.melt(ignore_index=False).reset_index(names='Subject')
# plot the melted dataframe
g = sns.catplot(kind='bar', data=dfm, x='Subject', y='value', col='Gender', hue='Skills')
# Flatten the axes for ease of use
axes = g.axes.ravel()
# relabel the yaxis
axes[0].set_ylabel('Percent')
# add bar labels
for ax in axes:
for c in ax.containers:
ax.bar_label(c, fmt='%0.1f%%')
Or swap x= and col= to col='Subject' and x='Gender'.
To draw plot, I am using seaborn and below is my code
import seaborn as sns
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
tips=tips.head()
ax = sns.barplot(x="day", y="total_bill",hue="sex", data=tips, palette="tab20_r")
I want to get and print frequency of data plots that is no. of times it occurred and below is the expected image
To Add label in bar,
I have used below code
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
So, With above code. I am able to display height with respect to x-axis , but I don't want height. I want frequency/count that satisfies relationship. For above example, there are 2 male and 3 female who gave tip on Sunday. So it should display 2 and 3 and not the amount of tip
Below is the code
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
df = sns.load_dataset("tips")
ax = sns.barplot(x='day', y='tip',hue="sex", data=df, palette="tab20_r")
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
How to display custom values on a bar plot does not clearly show how to annotate grouped bars, nor does it show how to determine the frequency of each hue category for each day.
How to plot and annotate grouped bars in seaborn / matplotlib shows how to annotate grouped bars, but not with custom labels.
for rect in ax.patches is an obsolete way to annotate bars. Use matplotlib.pyplot.bar_label, as fully described in How to add value labels on a bar chart.
Use pandas.crosstab or pandas.DataFrame.groupby to calculate the count of each category by the hue group.
As tips.info() shows, several columns have a category Dtype, which insures the plotting order and why the tp.index and tp.column order matches the x-axis and hue order of ax. Use pandas.Categorical to set a column to a category Dtype.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import pandas as pd
import seaborn as sns
# load the data
tips = sns.load_dataset('tips')
# determine the number of each gender for each day
tp = pd.crosstab(tips.day, tips.sex)
# or use groupby
# tp = tips.groupby(['day', 'sex']).sex.count().unstack('sex')
# plot the data
ax = sns.barplot(x='day', y='total_bill', hue='sex', data=tips)
# move the legend if needed
sns.move_legend(ax, bbox_to_anchor=(1, 1.02), loc='upper left', frameon=False)
# iterate through each group of bars, zipped to the corresponding column name
for c, col in zip(ax.containers, tp):
# add bar labels with custom annotation values
ax.bar_label(c, labels=tp[col], padding=3, label_type='center')
DataFrame Views
tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
tp
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18
I would like to add count and percentage labels to a grouped bar chart, but I haven't been able to figure it out.
I've seen examples for count or percentage for single bars, but not for grouped bars.
the data looks something like this (not the real numbers):
age_group Mis surv unk death total surv_pct death_pct
0 0-9 1 2 0 3 6 100.0 0.0
1 10-19 2 1 0 1 4 99.9 0.0
2 20-29 0 3 0 1 4 99.9 0.0
3 30-39 0 7 1 2 10 100.0 0.0
`4 40-49 0 5 0 1 6 99.7 0.3
5 50-59 0 6 0 4 10 99.3 0.3
6 60-69 0 7 1 4 12 98.0 2.0
7 70-79 1 8 2 5 16 92.0 8.0
8 80+ 0 10 0 7 17 81.0 19.0
And The chart looks something like this
I created the chart with this code:
ax = df.plot(y=['deaths', 'surv'],
kind='barh',
figsize=(20,9),
rot=0,
title= '\n\n surv and deaths by age group')
ax.legend(['Deaths', 'Survivals']);
ax.set_xlabel('\nCount');
ax.set_ylabel('Age Group\n');
How could I add count and percentage labels to the grouped bars? I would like it to look something like this chart
Since nobody else has suggested anything, here is one way to approach it with your dataframe structure.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.txt", delim_whitespace=True)
cat = ['death', 'surv']
ax = df.plot(y=cat,
kind='barh',
figsize=(20, 9),
rot=0,
title= '\n\n surv and deaths by age group')
#making space for the annotation
xmin, xmax = ax.get_xlim()
ax.set_xlim(xmin, 1.05 * xmax)
#connecting bar series with df columns
for cont, col in zip(ax.containers, cat):
#connecting each bar of the series with its absolute and relative values
for rect, vals, perc in zip(cont.patches, df[col], df[col+"_pct"]):
#annotating each bar
ax.annotate(f"{vals} ({perc:.1f}%)", (rect.get_width(), rect.get_y() + rect.get_height() / 2.),
ha='left', va='center', fontsize=10, color='black', xytext=(3, 0),
textcoords='offset points')
ax.set_yticklabels(df.age_group)
ax.set_xlabel('\nCount')
ax.set_ylabel('Age Group\n')
ax.legend(['Deaths', 'Survivals'], loc="lower right")
plt.show()
Sample output:
If the percentages per category add up, one could also calculate the percentages on the fly. This would then not necessitate that the percentage columns have exactly the same name structure. Another problem is that the font size of the annotation, the scaling to make space for labeling the largest bar, and the distance between bar and annotation are not interactive and may need fine-tuning.
However, I am not fond of this mixing of pandas and matplotlib plotting functions. I had cases where the axis definition by pandas interfered with matplotlib, and datetime objects ... well, let's not talk about that.
I am wanting to display the confidence interval for each bar in my plot, but they do not seem to show. I have two dataframes, and I am displaying the average of the NUMBER_GIRLS column in my plot from both dataframes.
For example, consider the two dataframes (shown below).
schools_north_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 32
2 SCHOOL_2 12
3 SCHOOL_3 26
schools_south_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 56
2 SCHOOL_2 33
3 SCHOOL_3 34
Therefore, I have used this code (shown below) to plot my barplot with the confidence intervals showing for each bar - but when plotting it, the confidence interval does not show up.
import matplotlib.pyplot as plt
objects = ('North', 'South')
y_pos = np.arange(len(objects))
avg_girls = [schools_north_df[NUMBER_GIRLS].mean(), schools_south_df[NUMBER_GIRLS].mean()]
sns.barplot(y_pos, avg_girls, ci=95)
plt.xticks(y_pos, objects)
plt.title('Average Number of Girls')
plt.show()
If anyone could kindly help me and indicate what is wrong with my code. I really need the confidence interval to display on my barplot.
Thank you very much!
If you want seaborn to display the confidence intervals, you need to let seaborn aggregate the data by itself (that is to say, provide the raw data instead of calculating the mean yourself).
I would create a new dataframe with an extra column (region) to indicate whether the data are from the "north" or the "south" and then request seaborn to plot NUMBER_GIRLS vs region:
df = pd.concat([schools_north_df.assign(region='North'), schools_south_df.assign(region='South')])
output:
ID NAME NUMBER_GIRLS region
0 1 SCHOOL_1 32 North
1 2 SCHOOL_2 12 North
2 3 SCHOOL_3 26 North
0 1 SCHOOL_1 56 South
1 2 SCHOOL_2 33 South
2 3 SCHOOL_3 34 South
plot:
sns.barplot(data=df, x='region', y='NUMBER_GIRLS', ci=95)
My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations