Stacked Bar Plot By Group Count On Pandas Python - python

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:

This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations

Related

Modify a bar plot into a staked plot keeping the original values

I have a pandas DataFrame containing the percentage of students that have a certain skill in each subject stratified according to their gender
iterables = [['Above basic','Basic','Low'], ['Female','Male']]
index = pd.MultiIndex.from_product(iterables, names=["Skills", "Gender"])
df = pd.DataFrame(data=[[36,36,8,8,6,6],[46,46,2,3,1,2],[24,26,10,11,16,13]], index=["Math", "Literature", "Physics"], columns=index)
print(df)
Skill Above basic Basic Low
Gender Female Male Female Male Female Male
Math 36 36 8 8 6 6
Literature 46 46 2 3 1 2
Physics 24 26 10 11 16 13
Next I want to see how the skills are distributed according to the subjects
#plot how the skills are distributed according to the subjects
df.sum(axis=1,level=[0]).plot(kind='bar')
df.plot(kind='bar')
Now I would like to add the percentage of Male and Female to each bar in a stacked manner.. eg. for the fist bar ("Math", "Above basic") it should be 50/50. For the bar ("Literature", "Basic") it should be 40/60, for the bar ("Literature","Low") it should be 33.3/66.7 and so on...
Could you give me a hand?
Using the level keyword in DataFrame and Series aggregations, df.sum(axis=1,level=[0]), is deprecated.
Use df.groupby(level=0, axis=1).sum()
df.div(dfg).mul(100).round(1).astype(str) creates a DataFrame of strings with the 'Female' and 'Male' percent for each of the 'Skills', which can be used to create a custom bar label.
As shown in this answer, use matplotlib.pyplot.bar_label to annotate the bars, which has a labels= parameter for custom labels.
Tested in python 3.11, pandas 1.5.3, matplotlib 3.7.0, seaborn 0.12.2
# group df to create the bar plot
dfg = df.groupby(level=0, axis=1).sum()
# calculate the Female / Male percent for each Skill
percent_s = df.div(dfg).mul(100).round(1).astype(str)
# plot the bars
ax = dfg.plot(kind='bar', figsize=(10, 7), rot=0, width=0.9, ylabel='Total Percent\n(Female/Male split)')
# iterate through the bar containers
for c in ax.containers:
# get the Skill label
label = c.get_label()
# use the Skill label to get the current group based on level, join the strings,and get an array of custom labels
labels = percent_s.loc[:, percent_s.columns.get_level_values(0).isin([label])].agg('/'.join, axis=1).values
# add the custom labels to the center of the bars
ax.bar_label(c, labels=labels, label_type='center')
# add total percent to the top of the bars
ax.bar_label(c, weight='bold', fmt='%g%%')
percent_s
Skills Above basic Basic Low
Gender Female Male Female Male Female Male
Math 50.0 50.0 50.0 50.0 50.0 50.0
Literature 50.0 50.0 40.0 60.0 33.3 66.7
Physics 48.0 52.0 47.6 52.4 55.2 44.8
Optionally, melt df into a long form, and use sns.catplot with kind='bar' to plot each 'Gender' in a separate Facet.
# melt df into a long form
dfm = df.melt(ignore_index=False).reset_index(names='Subject')
# plot the melted dataframe
g = sns.catplot(kind='bar', data=dfm, x='Subject', y='value', col='Gender', hue='Skills')
# Flatten the axes for ease of use
axes = g.axes.ravel()
# relabel the yaxis
axes[0].set_ylabel('Percent')
# add bar labels
for ax in axes:
for c in ax.containers:
ax.bar_label(c, fmt='%0.1f%%')
Or swap x= and col= to col='Subject' and x='Gender'.

Show Count and percentage labels for grouped bar chart python

I would like to add count and percentage labels to a grouped bar chart, but I haven't been able to figure it out.
I've seen examples for count or percentage for single bars, but not for grouped bars.
the data looks something like this (not the real numbers):
age_group Mis surv unk death total surv_pct death_pct
0 0-9 1 2 0 3 6 100.0 0.0
1 10-19 2 1 0 1 4 99.9 0.0
2 20-29 0 3 0 1 4 99.9 0.0
3 30-39 0 7 1 2 10 100.0 0.0
`4 40-49 0 5 0 1 6 99.7 0.3
5 50-59 0 6 0 4 10 99.3 0.3
6 60-69 0 7 1 4 12 98.0 2.0
7 70-79 1 8 2 5 16 92.0 8.0
8 80+ 0 10 0 7 17 81.0 19.0
And The chart looks something like this
I created the chart with this code:
ax = df.plot(y=['deaths', 'surv'],
kind='barh',
figsize=(20,9),
rot=0,
title= '\n\n surv and deaths by age group')
ax.legend(['Deaths', 'Survivals']);
ax.set_xlabel('\nCount');
ax.set_ylabel('Age Group\n');
How could I add count and percentage labels to the grouped bars? I would like it to look something like this chart
Since nobody else has suggested anything, here is one way to approach it with your dataframe structure.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.txt", delim_whitespace=True)
cat = ['death', 'surv']
ax = df.plot(y=cat,
kind='barh',
figsize=(20, 9),
rot=0,
title= '\n\n surv and deaths by age group')
#making space for the annotation
xmin, xmax = ax.get_xlim()
ax.set_xlim(xmin, 1.05 * xmax)
#connecting bar series with df columns
for cont, col in zip(ax.containers, cat):
#connecting each bar of the series with its absolute and relative values
for rect, vals, perc in zip(cont.patches, df[col], df[col+"_pct"]):
#annotating each bar
ax.annotate(f"{vals} ({perc:.1f}%)", (rect.get_width(), rect.get_y() + rect.get_height() / 2.),
ha='left', va='center', fontsize=10, color='black', xytext=(3, 0),
textcoords='offset points')
ax.set_yticklabels(df.age_group)
ax.set_xlabel('\nCount')
ax.set_ylabel('Age Group\n')
ax.legend(['Deaths', 'Survivals'], loc="lower right")
plt.show()
Sample output:
If the percentages per category add up, one could also calculate the percentages on the fly. This would then not necessitate that the percentage columns have exactly the same name structure. Another problem is that the font size of the annotation, the scaling to make space for labeling the largest bar, and the distance between bar and annotation are not interactive and may need fine-tuning.
However, I am not fond of this mixing of pandas and matplotlib plotting functions. I had cases where the axis definition by pandas interfered with matplotlib, and datetime objects ... well, let's not talk about that.

Categorical axis in Holoviews Curve leads to error

I have this data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQxY-rTQNBbMWtRI2p8m_gj0TvmkHt3_CqNkhRILq6s5xL3mGX-IXtZt9pej-ae3ZNN1EnAP_iC0unx/pub?output=csv', index=False)
df.head()
Id Year Period Cumulative YearPeriod YearPeriodStr
0 16432976 1 1 6.0 11 Period 11
1 16432976 1 2 7.0 12 Period 12
2 16432976 1 0 0.0 10 Period 10
3 16879454 1 1 6.0 11 Period 11
4 16879454 1 2 6.0 12 Period 12
What I want to achieve is a combination of two ordinal variables on the x-axis. The first level should be Year, the second Period. The following code is what I thought would lead to the graph I want:
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
ds = hv.Dataset(df).sort(['Year','Period'])
ds.to(hv.Curve, ['Year','Period'],'Cumulative', 'Id').overlay()
But this puts Period on the y-axis. The following code gets me closer to what I want:
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
# Numerical column
df['YearPeriod'] = (df.Year * 10) + (df.Period)
ds = hv.Dataset(df).sort(['Year','Period'])
ds.to(hv.Curve, 'YearPeriod','Cumulative', 'Id').overlay()
But the x-axis now is a continuous variable, while what I want is the second level, Period, to be ordinal.
The following code leads more or less to what I want, but fails when the amount of rows exceeds 120.
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
# String
df['YearPeriodStr'] = 'Period'+ df['Year'].astype(str) + df['Period'].astype(str)
ds = hv.Dataset(df.head(120)).sort(['Year','Period'])
ds.to(hv.Curve, 'YearPeriodStr','Cumulative', 'Id').overlay()
What am I missing?

Appropriate handling of Pandas dataframe scatterplot with varying colors and marker sizes

Given a dataframe df with columns A, B, C, and D,
A B C D
0 88 38 15.66 30.0
1 88 34 15.66 40.0
2 15 15 12.00 20.0
3 15 19 8.00 15.0
4 45 12 6.00 15.0
5 45 30 4.00 30.0
6 29 31 3.60 15.0
7 88 20 3.60 10.0
8 64 25 3.60 15.0
9 45 43 3.60 20.0
I want to make a scatter plot that graphs A vs B, with sizes based on C and colors based on D. After trying many ways to do this, I settled on grouping the data by D, then plotting each group in D:
fig,axes=plt.subplots()
factor=df.groupby('D')
for name, group in factor:
axes.scatter(group.A,group.B,s=(group.C)**2,c=group.D,
cmap='viridis',norm=Normalize(vmin=min(df.D),vmax=max(df.D)),label=name)
This yields the appropriate result, but the default legend() function is wrong. The groups listed in the legend have correct names, but incorrect colors and sizes (colors should vary by group, and sizes of all markers should be the same).
I tried to manually set the legend, which I can approximate the colors but can't get the sizes to be equal. Eventually I'd like a second legend that will link sizes to the appropriate levels of C.
axes.legend(loc=1,scatterpoints=1,fontsize='small',frameon=False,ncol=2)
leg=axes.get_legend()
for i in range(len(factor)):
z=plt.cm.viridis(np.linspace(0,1,len(factor)))
leg.legendHandles[i].set_color(z[i])
Here's one approach that seems to satisfy your requirements, using Seaborn's lmplot(). (Inspiration taken from this post.)
First, generate some sample data:
import numpy as np
import pandas as pd
n = 10
min_size = 50
max_size = 300
A = np.random.random(n)
B = np.random.random(n)*2
C = np.random.randint(min_size, max_size, size=n)
D = np.random.choice(['Group1','Group2'], n)
df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D})
Now plot:
import seaborn as sns
sns.lmplot(x='A', y='B', hue='D',
fit_reg=False, data=df,
scatter_kws={'s':df.C})
UPDATE
Given updated example data from OP, the same lmplot() approach should fulfill specifications: group legend is tracked by color, size of legend indicators is equal.
sns.lmplot(x='A', y='B', hue='D', data=df,
scatter_kws={'s':df.C**2}, fit_reg=False,)

Stacked Plot To Represent Genders For An Age Group From CSV containing Identifier , Age and Gender On Python / Pandas/ Matplotlib

I have a csv data with age, gender(Men,Women) and identifier. I grouped age and gender of individuals by count of identifier on pandas with
counts = df.groupby(['Age','Gender']).count()
print counts
and the result looked something like this :
Age Gender Id_count
15 W 1
17 M 1
19 M 2
20 M 6
W 1
21 M 3
W 1
23 M 4
W 3
24 M 8
W 3
25 M 9
26 M 6
W 1
27 M 3
W 1
28 M 9
W 2
29 M 5
W 1
30 M 3
31 M 9
W 1 ..
Unique ages on my dataset are from age 15 to 90. I now want to do an age group analysis with a stacked plot at the end.For that , i want to lets say range the ages into certain age group (10-20,21-30,31-40 and so on) and plot sum of identifier on each age group , showing sum on the top of the bar and my aim is to get two different colors for stacked bar representing men and women according to their proportion of id_count. To implement this : i created a dictionary where i gave range as shown below..
df['ids_counted']= np.round(df['Age'])
categories_dict = { 15 : 'Between 10 and 20',
16 : 'Between 10 and 20',
17 : 'Between 10 and 20',
18 : 'Between 10 and 20',
19 : 'Between 10 and 20',
20 : 'Between 10 and 20',
21 : 'Between 21 and 30',
22 : 'Between 21 and 30',..
90 : 'Between 81 and 90',}
Then I created this dataframe.
df['category'] = df['id_counted'].map(categories_dict)
count2 = df.groupby(['category','Age','Gender','Id_Count']).count()
total= count2.sum(level= 0)
print total
now i have successfully counted the total of identifier on each age group. It looked something like this :
Between 10 and 20 11
Between 21 and 30 62
Between 31 and 40 82
Between 41 and 50 120
Between 51 and 60 125
Between 61 and 70 141
Between 71 and 80 192
Between 81 and 90 38
But i lost my way here because i wanted to plot gender too. lets take age between 10 and 20 . Total 11 should have been on the top of my bar and portion 9 men and 2 women should have been plotted on a stacked bar. I thought about another approach because i think this way to approach won't get me to my result. I generated a grouped dataframe with the counts of each M and F per age, then calculated the total number of individual per age group.
totals = counts.sum(level=0)
Now to plot :
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['W'], bottom=counts['M'], color='red', label='W')
plt.legend()
plt.xlabel('Age Group')
plt.ylabel('Occurences Of Identifiers')
plt.title('ttl',fontsize=20)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('{:d}'.format(tot), xy=(age+0.39, tot), xytext=(0,1), textcoords='offset points', ha='center', va='bottom')
plt.show()
plt.save()
plt.close()
and got this plot which turned out to be okay but it is for individual age and my target is to generate same plot for age group on my dictionary. I would be very grateful if anyone would suggest me or give me an idea to obtain my aimed result. Thank you so much for your time.
Assigning age groups is easier using np.digitize.
n = 100
age = np.random.randint(15, 91, size=n)
gender = np.random.randint(2, size=n)
df = pd.DataFrame.from_items([('Age', age), ('Gender', gender)])
bins = np.arange(1, 10) * 10
df['category'] = np.digitize(df.Age, bins, right=True)
print(df.head())
Age Gender category
0 22 1 2
1 54 0 5
2 85 1 8
3 77 0 7
4 86 1 8
Now count grouping by category and gender, then unstack the result to have gender as columns.
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
Gender 0 1
category
1 2 7
2 7 5
3 6 4
4 11 9
5 5 8
6 2 4
7 10 7
8 6 7
Plotting is now a breeze.
counts.plot(kind='bar', stacked=True)
This turned out to be my code at last :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use('fivethirtyeight')
df = pd.read_csv('/home/Desktop/cocktail_ids_age_gender.csv')
df.values
bins = np.arange(10, 100, 10)
df['category'] = np.digitize(df.Age, bins, right=True)
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
ax = counts.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0).astype(np.int64), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.xlabel ('Age Group')
plt.ylabel ('Co-Occurences ')
plt.title('Comparison Of Occurences In An Age Group',fontsize=20)
plt.show()
And i decided to leave it stacked anyways because it made analysis easier. Everything turned out well , thanks to goyo. But the only thing that is again bothering me is my x-axis. Instead of showing 1 , 2 , 3 , 4.. i wanted to show 10-20,20-30 and so on. I am not grasping how i could do that. Can anyone help me. Thank you

Categories

Resources