Plot groupby percentage dataframe - python

I didn't find a complete answer to what i want to do:
I have a dataframe. I want to group by user and their answers to a survey, sum all of their good answers/total of their answers, display it in % and plot the result.
I have an answer column which contains : 1,0 or -1. I want to filter it in order to exclude -1.
Here is what i did so far :
df_sample.groupby('user').filter(lambda x : x['answer'].mean() >-1)
or :
a = df_sample.loc[df_sample['answer']!=-1,['user','answer']]
b = a.groupby(['user','answer']).agg({'answer' : 'sum'})
See it's uncomplete. Thank you for any suggestion that you may have.

Let's try with some sample data:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
df.head():
user answer
0 D 1
1 C 0
2 D -1
3 B 1
4 C 1
Option 1
Filter out the -1 values and use named aggregation to get the "good answers" and "total answers":
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df:
good_answer total_answer
user
A 9 15
B 11 20
C 15 19
D 7 14
Use division and multiplication to get percentage:
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
plot_df:
good_answer total_answer pct
user
A 9 15 60.000000
B 11 20 55.000000
C 15 19 78.947368
D 7 14 50.000000
Then this can be plotted with DataFrame.plot:
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Option 2
If just the percentage is needed groupby mean can be used to get to the resulting plot directly after filtering out the -1s:
plot_df = df[df['answer'].ne(-1)].groupby('user')['answer'].mean().mul(100)
ax = plot_df.plot(
kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
plot_df:
answer
user
A 60.000000
B 55.000000
C 78.947368
D 50.000000
Both options Produce:
All together:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()

Here is a sample solution assuming you want to calculate percentage based on filtered dataframe.
import pandas as pd
import numpy as np
df_sample = pd.DataFrame(np.random.randint(-1,2,size=(10, 1)), columns=['answer'])
df_sample['user'] = [i for i in 'a b c d e f a b c d'.split(' ')]
df_filtered = df_sample[df_sample.answer>-1]
print(df_filtered.groupby('user').agg({'answer' : lambda x: x.sum()/len(df_filtered)*100}))

Related

Divide two columns in pivot table and plot grouped bar chart with pandas

I have a dataset that looks like this:
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
Vintage Model Count Case
0 2016Q1 A 1 0
1 2016Q1 A 1 1
2 2016Q2 A 1 1
3 2016Q3 A 1 0
4 2016Q4 A 1 1
5 2016Q1 B 1 1
6 2016Q2 B 1 0
7 2016Q2 B 1 0
8 2016Q2 B 1 1
9 2016Q3 B 1 1
10 2016Q4 B 1 0
What I need to do is:
Plot grouped bar chart, where vintage is the groups and model is the hue/color
Two line plots in the same chart that show the percentage of case over count, aka plot the division of case over count for each model and vintage.
I figured out how to do the first task with a pivot table but haven't been able to add the percentage from the same pivot.
This is the solution for point 1:
dfp = df.pivot_table(index='vintage', columns='model', values='count', aggfunc='sum')
dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
I tried dividing between columns in the pivot table but it's not the right format to plot.
How can I do the percentage calculation and line plots so without creating a different table?
Could the whole task be done with groupby instead? (as I find it easier to use in general)
Here's a solution using the seaborn plotting library, not sure if it's ok for you to use it for your problem
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
agg_df = df.groupby(['Vintage','Model']).sum().reset_index()
agg_df['Fraction'] = agg_df['Case']/agg_df['Count']
sns.barplot(
x = 'Vintage',
y = 'Count',
hue = 'Model',
alpha = 0.5,
data = agg_df,
)
sns.lineplot(
x = 'Vintage',
y = 'Fraction',
hue = 'Model',
marker = 'o',
legend = False,
data = agg_df,
)
plt.show()
plt.close()
IIUC you want the lines to be drawn on the same plot. I'd recommend creating a new y-axis after computing the division from the original df. Then you can plot the lines with seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
dfp = df.pivot_table(index='Vintage', columns='Model', values='Count', aggfunc='sum')
ax1 = dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
dfd = df.groupby(["Vintage", "Model"]).sum() \
.assign(div_pct=lambda x:100*x["Case"]/x["Count"]) \
.reset_index()
ax2 = ax1.twinx() # creating a second y axis
sns.lineplot(data=dfd, x="Vintage", y="div_pct", hue="Model", style="Model", ax=ax2, markers=True, dashes=False)
plt.show()
Output:

Create a stacked bar plot and annotate with count and percent

I have the following dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# 3.5.3
df=pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
'Length': [42,21,11,6,6,42,21,11,6,6,42],
'label': [1,1,0,0,0,1,1,0,0,0,1],
})
print(df)
# Type Length label
#0 Sentence 42 1
#1 Array 21 1
#2 String 11 0
#3 - 6 0
#4 - 6 0
#5 Sentence 42 1
#6 Array 21 1
#7 String 11 0
#8 - 6 0
#9 - 6 0
#10 Sentence 42 1
I want to plot stacked bar chart for the arbitrary column within dataframe (either numerical e.g. Length column or categorical e.g. Type column) and stack with respect to label column using annotations of both count/percentage, where small values of rare observations are also displayed. The following script gives me the wrong results:
ax = df.plot.bar(stacked=True)
#ax = df[["Type","label"]].plot.bar(stacked=True)
#ax = df.groupby('Type').size().plot(kind='bar', stacked=True)
ax.legend(["0: normanl", "1: Anomaly"])
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2,
y+height/2,
'{:.0f} %'.format(height),
horizontalalignment='center',
verticalalignment='center')
I can imagine that somehow I need to calculate the counts of the selected column with respect to label column:
## counts will be used for the labels
counts = df.apply(lambda x: x.value_counts())
## percents will be used to determine the height of each bar
percents = counts.div(counts.sum(axis=1), axis=0)
I tried to solve the problem by using df.groupby(['selcted column', 'label'] unsuccessfully. I collected all possible solutions in this Google Colab Notebook nevertheless I couldn't find a straightforward way to adapt into dataframe.
So far I have tried following solution inspired by this post to solve the problem by using df.groupby(['selcted column', 'label'] unsuccessfully and I got TypeError: unsupported operand type(s) for +: 'int' and 'str' for total = sum(dff.sum()) can't figure out what is the problem? in indexing or df transformation.
BTW I collected all possible solutions in this Google Colab Notebook nevertheless I couldn't find a straightforward way to adapt into dataframe via Mathplotlib. So I'm looking for an elegant way of using Seaborn or plotly.
df = df.groupby(["Type","label"]).count()
#dfp_Type = df.pivot_table(index='Type', columns='label', values= 'Length', aggfunc='mean')
dfp_Type = df.pivot_table(index='Type', columns='label', values= df.Type.size(), aggfunc='mean')
#dfp_Length = df.pivot_table(index='Length', columns='label', values= df.Length.size(), aggfunc='mean')
ax = dfp_Type.plot(kind='bar', stacked=True, rot=0)
# iterate through each bar container
for c in ax.containers: labels = [v.get_height() if v.get_height() > 0 else '' for v in c]
# add the annotations
ax.bar_label(c, fmt='%0.0f%%', label_type='center')
# move the legend
ax.legend(title='Class', bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
output:
Expected output:
The values in Expected output do not match df in the OP, so the sample DataFrame has been updated.
Plot with pandas.DataFrame.plot, using kind='bar' and stacked=True. pandas uses and imports matplotlib as the default plotting backend, so there's no need to import other plotting libraries.
Resources:
How to aggregate unique count with pandas pivot_table for details about using aggfunc=len in .pivot_table.
How to add value labels on a bar chart for details and examples about .bar_label.
How to add multiple annotations to a bar plot & How to create and annotate a stacked proportional bar chart for adding count and percent to a bar plot.
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1
import pandas as pd
# sample dataframe
df = pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
'Length': [42, 21, 11, 6, 6, 42, 21, 11, 6, 6, 42],
'label': [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
# pivot the dataframe and get len
dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len)
# get the total for each row
total = dfp.sum(axis=1)
# calculate the percent for each row
per = dfp.div(total, axis=0).mul(100).round(2)
# plot the pivoted dataframe
ax = dfp.plot(kind='bar', stacked=True, figsize=(10, 8), rot=0)
# set the colors for each Class
segment_colors = {'0': 'white', '1': 'black'}
# iterate through the containers
for c in ax.containers:
# get the current segment label (a string); corresponds to column / legend
label = c.get_label()
# create custom labels with the bar height and the percent from the per column
# the column labels in per and dfp are int, so convert label to int
labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])]
# add the annotation
ax.bar_label(c, labels=labels, label_type='center', fontweight='bold', color=segment_colors[label])
# move the legend
_ = ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left')
Comment Updates
How to always have a spot for 'Array' if it's not in the data:
Add 'Array' to dfp if it's not in dfp.index.
df.Type = pd.Categorical(df.Type, ['-', 'Array', 'Sentence', 'String'], ordered=True) does not ensure the missing categories are plotted.
How to have all the annotations, even if they're small:
Don't stack the bars, and set logy=True.
This uses the full-data, which was provided in a link.
# pivot the dataframe and get len
dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len)
# append Array if it's not included
if 'Array' not in dfp.index:
dfp = pd.concat([dfp, pd.DataFrame({0: [np.nan], 1: [np.nan]}, index=['Array'])])
# order the index
dfp = dfp.loc[['-', 'Array', 'Sentence', 'String'], :]
# calculate the percent for each row
per = dfp.div(dfp.sum(axis=1), axis=0).mul(100).round(2)
# plot the pivoted dataframe
ax = dfp.plot(kind='bar', stacked=False, figsize=(10, 8), rot=0, logy=True, width=0.75)
# iterate through the containers
for c in ax.containers:
# get the current segment label (a string); corresponds to column / legend
label = c.get_label()
# create custom labels with the bar height and the percent from the per column
# the column labels in per and dfp are int, so convert label to int
labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])]
# add the annotation
ax.bar_label(c, labels=labels, label_type='edge', fontsize=10, fontweight='bold')
# move the legend
ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left')
# pad the spacing between the number and the edge of the figure
_ = ax.margins(y=0.1)
DataFrame Views
Based on the sample data in the OP
df
Type Length label
0 Sentence 42 1
1 Array 21 1
2 String 11 0
3 - 6 0
4 - 6 0
5 Sentence 42 1
6 Array 21 1
7 String 11 0
8 - 6 0
9 - 6 1
10 Sentence 42 0
dfp
label 0 1
Type
- 3.0 1.0
Array NaN 2.0
Sentence 1.0 2.0
String 2.0 NaN
total
Type
- 4.0
Array 2.0
Sentence 3.0
String 2.0
dtype: float64
per
label 0 1
Type
- 75.00 25.00
Array NaN 100.00
Sentence 33.33 66.67
String 100.00 NaN
I slightly adjusted the data so the graph would look identical to yours(e.g., Type:-'s label has three 0 and one 1)
df
###
Type Length label
0 Sentence 42 1
1 Array 21 1
2 String 11 0
3 - 6 0
4 - 6 0
5 Sentence 42 1
6 Array 21 1
7 String 11 0
8 - 6 0
9 - 6 1
10 Sentence 42 0
df_plot = df.groupby(['Type','label']).size().reset_index()
df_plot.columns = ['Type', 'Class', 'count']
df_plot = df_plot.astype({'Class':'str'})
df_plot['percentage'] = df.groupby(['Type','label']).size().groupby(level=0).apply(lambda x: 100*x/float(x.sum())).values.round(2).astype(str)
df_plot['percentage'] = "(" + df_plot['percentage'] + '%)'
df_plot
###
Type Class count percentage
0 - 0 3 (75.0%)
1 - 1 1 (25.0%)
2 Array 1 2 (100.0%)
3 Sentence 0 1 (33.33%)
4 Sentence 1 2 (66.67%)
5 String 0 2 (100.0%)
fig = px.bar(df_plot,
x='Type',
y='count',
color='Class',
text=df_plot['count'].astype(str) + "<br>" + df_plot['percentage'],
width=550,
height=400,
category_orders={'Type':['-','Array','Sentence','String']},
template='plotly_white',
log_y=True
)
fig.show('browser')
with your CSV file followed the same ELT turning into df_plot2,
while Class 0 and 1 has a huge difference,
A stacked bar chart(default setting) won't give you distinguishable outcome,
we can use barmode='group' instead,
fig2 = px.bar(df_plot2,
barmode='group',
x='Type',
y='count',
color='Class',
color_discrete_map={'0':'#5DA597', '1':'#FFC851'},
text=df_plot2['count'].astype(str) + "<br>" + df_plot2['percentage'],
width=850,
height=800,
category_orders={'Type': ['-', 'Array', 'Sentence', 'String']},
template='plotly_white',
log_y=True,
)
fig2.update_yaxes(dtick=1)
fig2.show('browser')

Issue in Plotting multiple bars in one graph in python

I want to plot bar graph from the dataframe below.
df2 = pd.DataFrame({'URL': ['A','B','C','D','E','F'],
'X': [5,0,7,1,0,6],
'Y': [21,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
URL X Y Z
0 A 5 21 11
1 B 0 0 0
2 C 7 4 8
3 D 1 7 4
4 E 0 9 0
5 F 6 0 0
I want to plot bar graph in which I have URL counts on y-axis and X , Y, Z on x-axis with two bars for each. One bar will show the total sum of all the numbers in the respective column while another bar will show number of non zero values in column. The image of bar graph should look like this. If anyone can help me in this case. Thank you
You can use:
(df2
.reset_index()
.melt(id_vars=['index', 'URL'])
.assign(category=lambda d: np.where(d['value'].eq(0), 'Z', 'NZ'))
.pivot(['index', 'URL', 'variable'], 'category', 'value')
.groupby('variable')
.agg(**{'sum(non-zero)': ('NZ', 'sum'), 'count(zero)': ('Z', 'count')})
.plot.bar()
)
output:
df2.melt("URL").\
groupby("variable").\
agg(sums=("value", "sum"),
nz=("value", lambda x: sum(x != 0))).\
plot(kind="bar")
Try:
import pandas as pd
import matplotlib.pyplot as plt
df2 = pd.DataFrame({'URL': ['A','B','C','D','E','F'],
'X': [5,0,7,1,0,6],
'Y': [21,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
df2_ = df2[["X", "Y", "Z"]]
sums = df2_.sum().to_frame(name="sums")
nonzero_count = (~(df2_==0)).sum().to_frame(name="count_non_zero")
pd.concat([sums,nonzero_count], axis=1).plot.bar()
plt.show()

Plot: Stack bar with three columns

I have below dataframe.
df = '''type time gender
0 A 660.0 M
1 B 445.0 M
2 C 68.0 M
3 A 192.2 F
4 B 82.9 F
'''
I use the below code plot above data.
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
df.groupby(['gender','type']).count().groupby(level=0).apply(
lambda x:100*x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend( prop={'size': 16})
plt.show()
My output:
Q: I want to plot these two bars with involving the time of the data.
Appreciate it if someone can help!
Using matplotlib, then
(df.pivot_table(index='gender', columns='type',
values='time', aggfunc='sum',
fill_value=0)
.apply(lambda x: x/x.sum(), axis=1)
.plot.bar(stacked=True)
)
Output:

How to plot stacked & normalized histograms?

I have a dataset that maps continuous values to discrete categories. I want to display a histogram with the continuous values as x and categories as y, where bars are stacked and normalized. Example:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
},
columns=['score', 'category'])
print df.head(10)
Output:
score category
0 0.649371 B
1 0.042309 B
2 0.689487 A
3 0.433064 B
4 0.978859 A
5 0.789140 C
6 0.215758 D
7 0.922389 B
8 0.105364 D
9 0.010274 C
If I try to plot this as a histogram using df.hist(by='category'), I get 4 graphs:
I managed to get the graph I wanted but I had to do a lot of manipulation.
# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
'score' : df.score,
'A' : (df.category == 'A').astype(float),
'B' : (df.category == 'B').astype(float),
'C' : (df.category == 'C').astype(float),
'D' : (df.category == 'D').astype(float)
},
columns=['score', 'A', 'B', 'C', 'D'])
# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])
# Sum over series for weights
df4 = df3.sum(1)
bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))
bars.plot.bar(stacked=True)
I expect there is a more straightforward way to do this, easier to read and understand and more optimized with less intermediate steps. Any solutions?
I dont know if this is really that much more compact or readable than what you already got but it is a suggestion (a late one as such :)).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
}, columns=['score', 'category'])
# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)
# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size()
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)
# Plot
df_a.unstack().plot.bar(stacked=True)
Consider assigning bins with cut, calculating grouping percentages with couple of groupby().transform calls, and then aggregate and reshape with pivot_table:
# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1),
labels=np.arange(0,1,0.1)).round(1)
# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
.div(df.groupby(['plot_bins'])['score'].transform('count')))
# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
.reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)

Categories

Resources