How to annotate grouped bars with group count instead of bar height - python

To draw plot, I am using seaborn and below is my code
import seaborn as sns
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
tips=tips.head()
ax = sns.barplot(x="day", y="total_bill",hue="sex", data=tips, palette="tab20_r")
I want to get and print frequency of data plots that is no. of times it occurred and below is the expected image
To Add label in bar,
I have used below code
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
So, With above code. I am able to display height with respect to x-axis , but I don't want height. I want frequency/count that satisfies relationship. For above example, there are 2 male and 3 female who gave tip on Sunday. So it should display 2 and 3 and not the amount of tip
Below is the code
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
df = sns.load_dataset("tips")
ax = sns.barplot(x='day', y='tip',hue="sex", data=df, palette="tab20_r")
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()

How to display custom values on a bar plot does not clearly show how to annotate grouped bars, nor does it show how to determine the frequency of each hue category for each day.
How to plot and annotate grouped bars in seaborn / matplotlib shows how to annotate grouped bars, but not with custom labels.
for rect in ax.patches is an obsolete way to annotate bars. Use matplotlib.pyplot.bar_label, as fully described in How to add value labels on a bar chart.
Use pandas.crosstab or pandas.DataFrame.groupby to calculate the count of each category by the hue group.
As tips.info() shows, several columns have a category Dtype, which insures the plotting order and why the tp.index and tp.column order matches the x-axis and hue order of ax. Use pandas.Categorical to set a column to a category Dtype.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import pandas as pd
import seaborn as sns
# load the data
tips = sns.load_dataset('tips')
# determine the number of each gender for each day
tp = pd.crosstab(tips.day, tips.sex)
# or use groupby
# tp = tips.groupby(['day', 'sex']).sex.count().unstack('sex')
# plot the data
ax = sns.barplot(x='day', y='total_bill', hue='sex', data=tips)
# move the legend if needed
sns.move_legend(ax, bbox_to_anchor=(1, 1.02), loc='upper left', frameon=False)
# iterate through each group of bars, zipped to the corresponding column name
for c, col in zip(ax.containers, tp):
# add bar labels with custom annotation values
ax.bar_label(c, labels=tp[col], padding=3, label_type='center')
DataFrame Views
tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
tp
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

Related

Modify a bar plot into a staked plot keeping the original values

I have a pandas DataFrame containing the percentage of students that have a certain skill in each subject stratified according to their gender
iterables = [['Above basic','Basic','Low'], ['Female','Male']]
index = pd.MultiIndex.from_product(iterables, names=["Skills", "Gender"])
df = pd.DataFrame(data=[[36,36,8,8,6,6],[46,46,2,3,1,2],[24,26,10,11,16,13]], index=["Math", "Literature", "Physics"], columns=index)
print(df)
Skill Above basic Basic Low
Gender Female Male Female Male Female Male
Math 36 36 8 8 6 6
Literature 46 46 2 3 1 2
Physics 24 26 10 11 16 13
Next I want to see how the skills are distributed according to the subjects
#plot how the skills are distributed according to the subjects
df.sum(axis=1,level=[0]).plot(kind='bar')
df.plot(kind='bar')
Now I would like to add the percentage of Male and Female to each bar in a stacked manner.. eg. for the fist bar ("Math", "Above basic") it should be 50/50. For the bar ("Literature", "Basic") it should be 40/60, for the bar ("Literature","Low") it should be 33.3/66.7 and so on...
Could you give me a hand?
Using the level keyword in DataFrame and Series aggregations, df.sum(axis=1,level=[0]), is deprecated.
Use df.groupby(level=0, axis=1).sum()
df.div(dfg).mul(100).round(1).astype(str) creates a DataFrame of strings with the 'Female' and 'Male' percent for each of the 'Skills', which can be used to create a custom bar label.
As shown in this answer, use matplotlib.pyplot.bar_label to annotate the bars, which has a labels= parameter for custom labels.
Tested in python 3.11, pandas 1.5.3, matplotlib 3.7.0, seaborn 0.12.2
# group df to create the bar plot
dfg = df.groupby(level=0, axis=1).sum()
# calculate the Female / Male percent for each Skill
percent_s = df.div(dfg).mul(100).round(1).astype(str)
# plot the bars
ax = dfg.plot(kind='bar', figsize=(10, 7), rot=0, width=0.9, ylabel='Total Percent\n(Female/Male split)')
# iterate through the bar containers
for c in ax.containers:
# get the Skill label
label = c.get_label()
# use the Skill label to get the current group based on level, join the strings,and get an array of custom labels
labels = percent_s.loc[:, percent_s.columns.get_level_values(0).isin([label])].agg('/'.join, axis=1).values
# add the custom labels to the center of the bars
ax.bar_label(c, labels=labels, label_type='center')
# add total percent to the top of the bars
ax.bar_label(c, weight='bold', fmt='%g%%')
percent_s
Skills Above basic Basic Low
Gender Female Male Female Male Female Male
Math 50.0 50.0 50.0 50.0 50.0 50.0
Literature 50.0 50.0 40.0 60.0 33.3 66.7
Physics 48.0 52.0 47.6 52.4 55.2 44.8
Optionally, melt df into a long form, and use sns.catplot with kind='bar' to plot each 'Gender' in a separate Facet.
# melt df into a long form
dfm = df.melt(ignore_index=False).reset_index(names='Subject')
# plot the melted dataframe
g = sns.catplot(kind='bar', data=dfm, x='Subject', y='value', col='Gender', hue='Skills')
# Flatten the axes for ease of use
axes = g.axes.ravel()
# relabel the yaxis
axes[0].set_ylabel('Percent')
# add bar labels
for ax in axes:
for c in ax.containers:
ax.bar_label(c, fmt='%0.1f%%')
Or swap x= and col= to col='Subject' and x='Gender'.

Stacked Barplot dataset with percentage [duplicate]

I need help adding the percent distribution of the total (no decimals) in each section of a stacked bar plot in pandas created from a crosstab in a dataframe.
Here is sample data:
data = {
'Name':['Alisa','Bobby','Bobby','Alisa','Bobby','Alisa',
'Alisa','Bobby','Bobby','Alisa','Bobby','Alisa'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','English','English','Science','Science',
'Mathematics','Mathematics','English','English','Science','Science'],
'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}
df = pd.DataFrame(data)
# display(df)
Name Exam Subject Result
0 Alisa Semester 1 Mathematics Pass
1 Bobby Semester 1 Mathematics Pass
2 Bobby Semester 1 English Fail
3 Alisa Semester 1 English Pass
4 Bobby Semester 1 Science Fail
5 Alisa Semester 1 Science Pass
6 Alisa Semester 2 Mathematics Pass
7 Bobby Semester 2 Mathematics Fail
8 Bobby Semester 2 English Fail
9 Alisa Semester 2 English Pass
10 Bobby Semester 2 Science Pass
11 Alisa Semester 2 Science Fail
Here is my code:
#crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax.plot.bar(figsize=(10,10),stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='best', bbox_to_anchor=(0.1, 1.0),title="Subject",)
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
plt.show()
I know I need to add a plt.text some how, but can't figure it out. I would like the percent of the totals to be embedded within the stacked bars.
Let's try:
# crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax_1 = ax.plot.bar(figsize=(10,10), stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='upper center', bbox_to_anchor=(0.1, 1.0), title="Subject")
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
for rec in ax_1.patches:
height = rec.get_height()
ax_1.text(rec.get_x() + rec.get_width() / 2,
rec.get_y() + height / 2,
"{:.0f}%".format(height),
ha='center',
va='bottom')
plt.show()
Output:
Subject English Mathematics Science
Name
Alisa 33.333333 33.333333 33.333333
Bobby 33.333333 33.333333 33.333333
From matplotlib 3.4.2 use matplotlib.pyplot.bar_label
See this answer for a thorough explanation of using the method, and for additional examples.
Using label_type='center' will annotate with the value of each segment, and label_type='edge' will annotate with the cumulative sum of the segments.
It is easiest to plot stacked bars using pandas.DataFrame.plot with kind='bar' and stacked=True
To get the percent in a vectorized manner (without .apply):
Get the frequency count using pd.crosstab
Divide ct along axis=0 by ct.sum(axis=1)
It is important to specify the correct axis with .div and .sum.
Multiply by 100, and round.
This is best done using .crosstab because it results in a dataframe with the correct shape for plotting the stacked bars. .groupby would require further reshaping of the dataframe.
Tested in python 3.10, pandas 1.3.4, matplotlib 3.5.0
import pandas as pd
import matplotlib.pyplot as plt
# get a frequency count using crosstab
ct = pd.crosstab(df['Name'], df['Subject'])
# vectorized calculation of the percent per row
ct = ct.div(ct.sum(axis=1), axis=0).mul(100).round(2)
# display(ct)
Subject English Mathematics Science
Name
Alisa 33.33 33.33 33.33
Bobby 33.33 33.33 33.33
# specify custom colors
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
# plot
ax = ct.plot(kind='bar', figsize=(10, 10), stacked=True, rot=0, color=pal, xlabel='Name', ylabel='Percent Distribution')
# move the legend
ax.legend(title='Subject', bbox_to_anchor=(1, 1.02), loc='upper left')
# iterate through each bar container
for c in ax.containers:
# add the annotations
ax.bar_label(c, fmt='%0.0f%%', label_type='center')
plt.show()
Using label_type='edge' annotates with the cumulative sum

Show Count and percentage labels for grouped bar chart python

I would like to add count and percentage labels to a grouped bar chart, but I haven't been able to figure it out.
I've seen examples for count or percentage for single bars, but not for grouped bars.
the data looks something like this (not the real numbers):
age_group Mis surv unk death total surv_pct death_pct
0 0-9 1 2 0 3 6 100.0 0.0
1 10-19 2 1 0 1 4 99.9 0.0
2 20-29 0 3 0 1 4 99.9 0.0
3 30-39 0 7 1 2 10 100.0 0.0
`4 40-49 0 5 0 1 6 99.7 0.3
5 50-59 0 6 0 4 10 99.3 0.3
6 60-69 0 7 1 4 12 98.0 2.0
7 70-79 1 8 2 5 16 92.0 8.0
8 80+ 0 10 0 7 17 81.0 19.0
And The chart looks something like this
I created the chart with this code:
ax = df.plot(y=['deaths', 'surv'],
kind='barh',
figsize=(20,9),
rot=0,
title= '\n\n surv and deaths by age group')
ax.legend(['Deaths', 'Survivals']);
ax.set_xlabel('\nCount');
ax.set_ylabel('Age Group\n');
How could I add count and percentage labels to the grouped bars? I would like it to look something like this chart
Since nobody else has suggested anything, here is one way to approach it with your dataframe structure.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.txt", delim_whitespace=True)
cat = ['death', 'surv']
ax = df.plot(y=cat,
kind='barh',
figsize=(20, 9),
rot=0,
title= '\n\n surv and deaths by age group')
#making space for the annotation
xmin, xmax = ax.get_xlim()
ax.set_xlim(xmin, 1.05 * xmax)
#connecting bar series with df columns
for cont, col in zip(ax.containers, cat):
#connecting each bar of the series with its absolute and relative values
for rect, vals, perc in zip(cont.patches, df[col], df[col+"_pct"]):
#annotating each bar
ax.annotate(f"{vals} ({perc:.1f}%)", (rect.get_width(), rect.get_y() + rect.get_height() / 2.),
ha='left', va='center', fontsize=10, color='black', xytext=(3, 0),
textcoords='offset points')
ax.set_yticklabels(df.age_group)
ax.set_xlabel('\nCount')
ax.set_ylabel('Age Group\n')
ax.legend(['Deaths', 'Survivals'], loc="lower right")
plt.show()
Sample output:
If the percentages per category add up, one could also calculate the percentages on the fly. This would then not necessitate that the percentage columns have exactly the same name structure. Another problem is that the font size of the annotation, the scaling to make space for labeling the largest bar, and the distance between bar and annotation are not interactive and may need fine-tuning.
However, I am not fond of this mixing of pandas and matplotlib plotting functions. I had cases where the axis definition by pandas interfered with matplotlib, and datetime objects ... well, let's not talk about that.

How to create grouped bars charts with matplotlib with data in DataFrame

This is my current output:
Now i want the next bars next to the already plotted bars.
My DataFrame has 3 columns: 'Block', 'Cluster', and 'District'.
'Block' and 'Cluster' contain the numbers for plotting and the grouping is based
on the strings in 'District'.
How can I plot the other bars next to the existing bars?
df=pd.read_csv("main_ds.csv")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
plt.xticks(rotation=90)
bwidth=0.30
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
ax.autoscale(tight=False)
def autolabel(rects):
for rect in rects:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='top')
autolabel(indic1)
autolabel(indic2)
plt.show()
Data:
District Block Cluster Villages Schools Decadal_Growth_Rate Literacy_Rate Male_Literacy Female_Literacy Primary ... Govt_School Pvt_School Govt_Sch_Rural Pvt_School_Rural Govt_Sch_Enroll Pvt_Sch_Enroll Govt_Sch_Enroll_Rural Pvt_Sch_Enroll_Rural Govt_Sch_Teacher Pvt_Sch_Teacher
0 Dimapur 5 30 278 494 23.2 85.4 88.1 82.5 147 ... 298 196 242 90 33478 57176 21444 18239 3701 3571
1 Kiphire 3 3 94 142 -58.4 73.1 76.5 70.4 71 ... 118 24 118 24 5947 7123 5947 7123 853 261
2 Kohima 5 5 121 290 22.7 85.6 89.3 81.6 128 ... 189 101 157 49 10116 26464 5976 8450 2068 2193
3 Longleng 2 2 37 113 -30.5 71.1 75.6 65.4 60 ... 90 23 90 23 3483 4005 3483 4005 830 293
4 Mon 5 5 139 309 -3.8 56.6 60.4 52.4 165 ... 231 78 219 58 18588 16578 17108 8665 1667 903
5 rows × 26 columns
Try using pandas.DataFrame.plot
import pandas as pd
import numpy as np
from io import StringIO
from datetime import date
import matplotlib.pyplot as plt
def add_value_labels(ax, spacing=5):
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = spacing
# Vertical alignment for positive values
va = 'bottom'
# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'
# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)
# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.
first3columns = StringIO("""District Block Cluster
Dimapur 5 30
Kiphire 3 3
Kohima 5 5
Longleng 2
Mon 5 5
""")
df_plot = pd.read_csv(first3columns, delim_whitespace=True)
fig, ax = plt.subplots()
#df_plot.set_index(['District'], inplace=True)
df_plot[['Block', 'Cluster']].plot.bar(ax=ax, color=['r', 'b'])
ax.set_xticklabels(df_plot['District'])
add_value_labels(ax)
plt.show()
Try changing
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
to
indic1=ax.bar(df["District"]-bwidth/2,df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"]+bwidth/2,df["Cluster"], width=bwidth, color='b')

How to rotate seaborn barplot x-axis tick labels

I'm trying to get a barplot to rotate it's X Labels in 45° to make them readable (as is, there's overlap).
len(genero) is 7, and len(filmes_por_genero) is 20
I'm using a MovieLens dataset and making a graph counting the number of movies in each individual genre. Here's my code as of now:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
filmes_por_genero = filmes["generos"].str.get_dummies('|').sum().sort_values(ascending=False)
genero = filmes_com_media.index
chart = plt.figure(figsize=(16,8))
sns.barplot(x=genero,
y=filmes_por_genero.values,
palette=sns.color_palette("BuGn_r", n_colors=len(filmes_por_genero) + 4)
)
chart.set_xticklabels(
chart.get_xticklabels(),
rotation=45,
horizontalalignment='right'
)
Here's the full error:
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
623 in_axis=in_axis,
624 )
--> 625 if not isinstance(gpr, Grouping)
626 else gpr
627 )
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
254 self.name = name
255 self.level = level
--> 256 self.grouper = _convert_grouper(index, grouper)
257 self.all_grouper = None
258 self.index = index
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in _convert_grouper(axis, grouper)
653 elif isinstance(grouper, (list, Series, Index, np.ndarray)):
654 if len(grouper) != len(axis):
--> 655 raise ValueError("Grouper and axis must be same length")
656 return grouper
657 else:
ValueError: Grouper and axis must be same length
Data from MovieLens 25M Dataset at MovieLens
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# data
df = pd.read_csv('ml-25m/movies.csv')
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
# clean genres
df['genres'] = df['genres'].str.split('|')
df = df.explode('genres', ignore_index=True)
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy
Genres Counts
gc = df.genres.value_counts().to_frame()
print(gc)
genres
Drama 25606
Comedy 16870
Thriller 8654
Romance 7719
Action 7348
Horror 5989
Documentary 5605
Crime 5319
(no genres listed) 5062
Adventure 4145
Sci-Fi 3595
Children 2935
Animation 2929
Mystery 2925
Fantasy 2731
War 1874
Western 1399
Musical 1054
Film-Noir 353
IMAX 195
sns.barplot
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc) + 4), ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
plt.figure(figsize=(12, 6))
chart = sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
sns.countplot
Use sns.countplot to skip using .value_counts() if the plot order doesn't matter.
To order the countplot, order=df.genres.value_counts().index must be used, so countplot doesn't really save you from needing .value_counts(), if a descending order is desired.
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=df, x='genres', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
Shorter code for label rotation:
plt.xticks(rotation=45, ha='right')
Rotates labels by 45 degree
Aligns labels horizontally to the right for better readability
Full Example
sns.countplot with sorted x-axis
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('planets')
sns.countplot(data=df,
x='method',
order=df['method'].value_counts().index)
plt.xticks(rotation=45, ha='right');

Categories

Resources