seaborn: 'rows' and 'x_vars' at the same time - python

I want a seaborn multiplot that varies the x-axis variable by column, but varies the subset of data shown by row. I can use PairGrid to vary the variables graphed, and I can use FacetGrid to vary the subsets graphed, but I don't see any facility to do both at once, even though it seems like a natural extension.
Is there a way to do this in seaborn currently? Or is this something that would need a feature request?
Here's a mockup of what I'm trying to do:
label:A
y:Y
(plot M vs Y where label == A)
(plot N vs Y where label == A)
label:B
y:Y
(plot M vs Y where label == B)
(plot N vs Y where label == B)
x:M
x:N
I'd also take the transpose of this scheme :)

This is not a feature that directly exists in seaborn (though it is likely to become one at some point).
That said, FacetGrid and PairGrid just instantiate different mappings between a dataframe and a figure (modulo the diagonal plots in PairGrid and a few features here and there). So a plot that is naturally expressed using one tool can generally made with the other, given a data reshaping.
So you could do something like
x_var = "body_mass_g"
col_var = "sex"
hue_var = "species"
y_vars = ["bill_length_mm", "bill_depth_mm"]
(
df
.melt([x_var, col_var, hue_var], y_vars)
.pipe(
(sns.relplot, "data"),
x=x_var,
y="value",
hue=hue_var,
col=col_var,
row="variable",
facet_kws=dict(sharey="row"),
height=3.5,
)
)
There's your plot, basically, but the labels are a little confusing. Let's improve that:
g = (
df
.melt([x_var, col_var, hue_var], y_vars)
.pipe(
(sns.relplot, "data"),
x=x_var,
y="value",
hue=hue_var,
col=col_var,
row="variable",
facet_kws=dict(sharey="row", margin_titles=True),
height=3.5,
)
.set_titles(col_template="{col_var} = {col_name}", row_template="")
)
for (row_name, _), ax in g.axes_dict.items():
ax.set_ylabel(row_name)
A little more cumbersome, but also not so hard to wrap up into a pretty general function:
def paired_column_facets(
data: pd.DataFrame, y_vars: list[str], other_vars: dict[str, str], **kwargs
) -> FacetGrid:
g = (
df
.melt(list(other_vars.values()), y_vars)
.pipe(
(sns.relplot, "data"),
**other_vars,
y="value",
row="variable",
facet_kws=dict(sharey="row", margin_titles=True),
**kwargs,
)
.set_titles(col_template="{col_var} = {col_name}", row_template="")
)
for (row_name, _), ax in g.axes_dict.items():
ax.set_ylabel(row_name)
return g

Related

Is there a way to set a custom baseline for a stacked area chart in Plotly?

For context, what I'm trying to do is make an emission abatement chart that has the abated emissions being subtracted from the baseline. Mathematically, this is the same as adding the the abatement to the residual emission line:
Residual = Baseline - Abated
The expected results should look something like this:
Desired structure of stacked area chart:
I've currently got the stacked area chart to look like this:
As you can see, the way that the structure of stacked area chart is that the stacking starts at zero, however, I'm trying to get the stacking to either be added to the residual (red) line, or to be subtracted from the baseline (black).
I would do this in excel by just defining a blank area as the first stacked item, equal the residual line, so that the stacking occurs ontop of that. However, I'm not sure if there is a pythonic way to do this in plotly, while mainting the structure and interactivity of the chart.
The shaping of the pandas dataframes is pretty simple, just a randomly generated series of abatement values for each of the subcategories I've set up, that are then grouped to form the baseline and the residual forecasts:
scenario = 'High'
# The baseline data as a line
baseline_line = baselines.loc[baselines['Scenario']==scenario].groupby(['Year']).sum()
# The abatement and residual data
df2 = deepcopy(abatement).drop(columns=['tCO2e'])
df2['Baseline'] = baselines['tCO2e']
df2['Abatement'] = abatement['tCO2e']
df2 = df2.fillna(0)
df2['Residual'] = df2['Baseline'] - df2['Abatement']
df2 = df2.loc[abatement['Scenario']==scenario]
display(df2)
# The residual forecast as a line
emissions_lines = df2.loc[df2['Scenario']==scenario].groupby(['Year']).sum()
The charting is pretty simple as well, using the plotly express functionality:
# Just plotting
fig = px.area(df2,
x = 'Year',
y = 'Abatement',
color = 'Site',
line_group = 'Fuel_type'
)
fig2 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Baseline',
color_discrete_sequence = ['black'])
fig3 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Residual',
color_discrete_sequence = ['red'])
fig.add_trace(
fig2.data[0],
)
fig.add_trace(
fig3.data[0],
)
fig.show()
To summarise, I wish to have the Plotly stacked area chart be 'elevated' so that it fits between the residual and baseline forecasts.
NOTE: I've used the term 'baseline' with two meanings here. One specific to my example, and one generic to stacked area chart (in the title). The first usage, in the title, is meant to be the series for which the stacked area chart starts. Currently, this series is just the x-axis, or zero, I'm wishing to have this customised so that I can define a series (in this example, the red residual line) that the stacking can start from.
The second usage of the term 'baseline' refers to the 'baseline forecast', or BAU.
I think I've found a workaround, it is not ideal, but is similar to the approach I have taken in excel. I've ultimately added the 'residual' emissions in the same structure as the categories and concatenated it at the start of the DataFrame, so it bumps everything else up in between the residual and baseline forecasts.
Concatenation step:
# Me trying to make it cleanly at the residual line
df2b = deepcopy(emissions_lines)
df2b['Fuel_type'] = "Base"
df2b['Site'] = "Base"
df2b['Abatement'] = df2b['Residual']
df2c = pd.concat([df2b.reset_index(),df2],ignore_index=True)
Rejigged plotting step, with some recolouring/reformatting of the chart:
# Just plotting
fig = px.area(df2c,
x = 'Year',
y = 'Abatement',
color = 'Site',
line_group = 'Fuel_type'
)
# Making the baseline invisible and ignorable
fig.data[0].line.color='rgba(255, 255, 255,1)'
fig.data[0].showlegend = False
fig2 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Baseline',
color_discrete_sequence = ['black'])
fig3 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Residual',
color_discrete_sequence = ['red'])
fig.add_trace(
fig2.data[0],
)
fig.add_trace(
fig3.data[0],
)
fig.show()
Outcome:
I'm going to leave this unresolved, as I see this as not what I originally intended. It currently 'works', but this is not ideal and causes some issues with the interaction with the legend function in the Plotly object.

Plot bar chart using color to represent third dimension

My aim is to show a bar chart with 3-dim data, x, categorical and y1, y2 as continuous series; the bars should have heights from y1 and color to indicate y2.
This does not seem to be particularly obscure to me, but I didn't find a simple / built-in way to use a bar chart to visualise three dimensions -- I'm thinking mostly for exploratory purposes, before investigating relationships more formally.
Am I missing a type of plot in the libraries? Is there a good alternative to showing 3d data?
Anyway here are some things that I've tried that aren't particularly satisfying:
Some data for these attempts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Example data with explicit (-ve) correlation in the two series
n = 10; sd = 2.5
fruits = [ 'Lemon', 'Cantaloupe', 'Redcurrant', 'Raspberry', 'Papaya',
'Apricot', 'Cherry', 'Durian', 'Guava', 'Jujube']
np.random.seed(101)
cost = np.random.uniform(3, 15, n)
harvest = 50 - (np.random.randn(n) * sd + cost)
df = pd.DataFrame(data={'fruit':fruits, 'cost':cost, 'harvest':harvest})
df.sort_values(by="cost", inplace=True) # preferrable to sort during plot only
# set up several subplots to show progress.
n_colors = 5; cmap_base = "coolwarm" # a diverging map
fig, axs = plt.subplots(3,2)
ax = axs.flat
Attempt 1 uses hue for the 3rd dim data in barplot. However, this produces a single color for each value in the series, and also seems to do odd things with the bar width & spacing.
import seaborn as sns
sns.barplot(ax=ax[0], x='fruit', y='cost', hue='harvest',
data=df, palette=cmap_base)
# fix the sns barplot label orientation
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
Attempt 2 uses the pandas DataFrame.plot.bar, with a continuous color range, then adds a colorbar (need scalar mappable). I borrowed some techniques from medium post among others.
import matplotlib as mpl
norm = mpl.colors.Normalize(vmin=min(df.harvest), vmax=max(df.harvest), clip=True)
mapper1 = mpl.cm.ScalarMappable(norm=norm, cmap=cmap_base)
colors1 = [mapper1.to_rgba(x) for x in df.harvest]
df.plot.bar(ax=ax[1], x='fruit', y='cost', color=colors1, legend=False)
mapper1._A = []
plt.colorbar(mapper1, ax=ax[1], label='havest')
Attempt 3 builds on this, borrowing from https://gist.github.com/jakevdp/91077b0cae40f8f8244a to facilitate a discrete colormap.
def discrete_cmap(N, base_cmap=None):
"""Create an N-bin discrete colormap from the specified input map"""
# from https://gist.github.com/jakevdp/91077b0cae40f8f8244a
base = plt.cm.get_cmap(base_cmap)
color_list = base(np.linspace(0, 1, N))
cmap_name = base.name + str(N)
return base.from_list(cmap_name, color_list, N)
cmap_disc = discrete_cmap(n_colors, cmap_base)
mapper2 = mpl.cm.ScalarMappable(norm=norm, cmap=cmap_disc)
colors2 = [mapper2.to_rgba(x) for x in df.harvest]
df.plot.bar(ax=ax[2], x='fruit', y='cost', color=colors2, legend=False)
mapper2._A = []
cb = plt.colorbar(mapper2, ax=ax[2], label='havest')
cb.set_ticks(np.linspace(*cb.get_clim(), num=n_colors+1)) # indicate color boundaries
cb.set_ticklabels(["{:.0f}".format(t) for t in cb.get_ticks()]) # without too much precision
Finally, attempt 4 gives in to trying 3d in one plot and present in 2 parts.
sns.barplot(ax=ax[4], x='fruit', y='cost', data=df, color='C0')
ax[4].set_xticklabels(ax[4].get_xticklabels(), rotation=90)
sns.regplot(x='harvest', y='cost', data=df, ax=ax[5])
(1) is unusable - I'm clearly not using as intended. (2) is ok with 10 series but with more series is harder to tell whether a given sample is above/below average, for instance. (3) is quite nice and scales to 50 bars ok, but it is far from "out-of-the-box", too involved for a quick analysis. Moreover, the sm._A = [] seems like a hack but the code fails without it. Perhaps the solution in a couple of lines in (4) is a better way to go.
To come back to the question again: Is it possible easily produce a bar chart that displays 3d data? I've focused on using a small number of colors for the 3rd dimension for easier identification of trends, but I'm open to other suggestions.
I've posted a solution as well, which uses a lot of custom code to achieve what I can't really believe is not built in some graphing library of python.
edit:
the following code, using R's ggplot gives a reasonable approximation to (2) with built-in commands.
ggplot(data = df, aes(x =reorder(fruit, +cost), y = cost, fill=harvest)) +
geom_bar(data=df, aes(fill=harvest), stat='identity') +
scale_fill_gradientn(colours=rev(brewer.pal(7,"RdBu")))
The first 2 lines are more or less the minimal code for barplot, and the third changes the color palette.
So if this ease were available in python I'd love to know about it!
I'm posting an answer that does solve my aims of being simple at the point of use, still being useful with ~100 bars, and by leveraging the Fisher-Jenks 1d classifier from PySAL ends up handling outliers quite well (post about d3 coloring)
-- but overall is quite involved (50+ lines in the BinnedColorScaler class, posted at the bottom).
# set up the color binner
quantizer = BinnedColorScaler(df.harvest, k=5, cmap='coolwarm' )
# and plot dataframe with it.
df.plot.bar(ax=ax, x='fruit', y='cost',
color=df.harvest.map(quantizer.map_by_class))
quantizer.add_legend(ax, title='harvest') # show meaning of bins in legend
Using the following class that uses a nice 1d classifier from PySAL and borrows ideas from geoplot/geopandas libraries.
from pysal.esda.mapclassify import Fisher_Jenks
class BinnedColorScaler(object):
'''
give this an array-like data set, a bin count, and a colormap name, and it
- quantizes the data
- provides a bin lookup and a color mapper that can be used by pandas for selecting artist colors
- provides a method for a legend to display the colors and bin ranges
'''
def __init__(self, values, k=5, cmap='coolwarm'):
self.base_cmap = plt.cm.get_cmap(cmap) # can be None, text, or a cmap instane
self.bin_colors = self.base_cmap(np.linspace(0, 1, k)) # evenly-spaced colors
# produce bins - see _discrete_colorize in geoplot.geoplot.py:2372
self.binning = Fisher_Jenks(np.array(values), k)
self.bin_edges = np.array([self.binning.yb.min()] + self.binning.bins.tolist())
# some text for the legend (as per geopandas approx)
self.categories = [
'{0:.2f} - {1:.2f}'.format(self.bin_edges[i], self.bin_edges[i + 1])
for i in xrange(len(self.bin_edges) - 1)]
def map_by_class(self, val):
''' return a color for a given data value '''
#bin_id = self.binning.find_bin(val)
bin_id = self.find_bin(val)
return self.bin_colors[bin_id]
def find_bin(self, x):
''' unfortunately the pysal implementation seems to fail on bin edge
cases :(. So reimplement with the way we expect here.
'''
# wow, subtle. just <= instead of < in the uptos
x = np.asarray(x).flatten()
uptos = [np.where(value <= self.binning.bins)[0] for value in x]
bins = [v.min() if v.size > 0 else len(self.bins)-1 for v in uptos] #bail upwards
bins = np.asarray(bins)
if len(bins) == 1:
return bins[0]
else:
return bins
def add_legend(self, ax, title=None, **kwargs):
''' add legend showing the discrete colors and the corresponding data range '''
# following the geoplot._paint_hue_legend functionality, approx.
# generate a patch for each color in the set
artists, labels = [], []
for i in xrange(len(self.bin_colors)):
labels.append(self.categories[i])
artists.append(mpl.lines.Line2D(
(0,0), (1,0), mfc='none', marker='None', ls='-', lw=10,
color=self.bin_colors[i]))
return ax.legend(artists, labels, fancybox=True, title=title, **kwargs)

Combining FacetGrid and dual Y-axis in Pandas

I am trying to plot two different variables (linked by a relation of causality), delai_jour and date_sondage on a single FacetGrid. I can do it with this code:
g = sns.FacetGrid(df_verif_sum, col="prefecture", col_wrap=2, aspect=2, sharex=True,)
g = g.map(plt.plot, "date_sondage", "delai_jour", color="m", linewidth=2)
g = g.map(plt.bar, "date_sondage", "impossible")
which gives me this:
FacetGrid
(There are 33 of them in total).
I'm interested in comparing the patterns across the various prefecture, but due to the difference in magnitude I cannot see the changes in the line chart.
For this specific work, the best way to do it is to create a secondary y axis, but I can't seem to make anything work: it doesn't look like it's possible with FacetGrid, and I didn't understand the code not was able to replicate the examples i've seen with pure matplotlib.
How should I go about it?
I got this to work by iterating through the axes and plotting a secondary axis as in a typical Seaborn graph.
Using the OP example:
g = sns.FacetGrid(df_verif_sum, col="prefecture", col_wrap=2, aspect=2, sharex=True)
g = g.map(plt.plot, "date_sondage", "delai_jour", color="m", linewidth=2)
for ax, (_, subdata) in zip(g.axes, df_verif_sum.groupby('prefecture')):
ax2=ax.twinx()
subdata.plot(x='data_sondage',y='impossible', ax=ax2,legend=False,color='r')
If you do any formatting to the x-axis, you may have to do it to both ax and ax2.
Here's an example where you apply a custom mapping function to the dataframe of interest. Within the function, you can call plt.gca() to get the current axis at the facet being currently plotted in FacetGrid. Once you have the axis, twinx() can be called just like you would in plain old matplotlib plotting.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
def facetgrid_two_axes(*args, **kwargs):
data = kwargs.pop('data')
dual_axis = kwargs.pop('dual_axis')
alpha = kwargs.pop('alpha', 0.2)
kwargs.pop('color')
ax = plt.gca()
if dual_axis:
ax2 = ax.twinx()
ax2.set_ylabel('Second Axis!')
ax.plot(data['x'],data['y1'], **kwargs, color='red',alpha=alpha)
if dual_axis:
ax2.plot(df['x'],df['y2'], **kwargs, color='blue',alpha=alpha)
df = pd.DataFrame()
df['x'] = np.arange(1,5,1)
df['y1'] = 1 / df['x']
df['y2'] = df['x'] * 100
df['facet'] = 'foo'
df2 = df.copy()
df2['facet'] = 'bar'
df3 = pd.concat([df,df2])
win_plot = sns.FacetGrid(df3, col='facet', size=6)
(win_plot.map_dataframe(facetgrid_two_axes, dual_axis=True)
.set_axis_labels("X", "First Y-axis"))
plt.show()
This isn't the prettiest plot as you might want to adjust the presence of the second y-axis' label, the spacing between plots, etc. but the code suffices to show how to plot two series of differing magnitudes within FacetGrids.

How to plot multiple linear regressions in the same figure

Given the following:
import numpy as np
import pandas as pd
import seaborn as sns
np.random.seed(365)
x1 = np.random.randn(50)
y1 = np.random.randn(50) * 100
x2 = np.random.randn(50)
y2 = np.random.randn(50) * 100
df1 = pd.DataFrame({'x1':x1, 'y1': y1})
df2 = pd.DataFrame({'x2':x2, 'y2': y2})
sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
This will create 2 separate plots. How can I add the data from df2 onto the SAME graph? All the seaborn examples I have found online seem to focus on how you can create adjacent graphs (say, via the 'hue' and 'col_wrap' options). Also, I prefer not to use the dataset examples where an additional column might be present as this does not have a natural meaning in the project I am working on.
If there is a mixture of matplotlib/seaborn functions that are required to achieve this, I would be grateful if someone could help illustrate.
You could use seaborn's FacetGrid class to get desired result.
You would need to replace your plotting calls with these lines:
# sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
# sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
df = pd.concat([df1.rename(columns={'x1':'x','y1':'y'})
.join(pd.Series(['df1']*len(df1), name='df')),
df2.rename(columns={'x2':'x','y2':'y'})
.join(pd.Series(['df2']*len(df2), name='df'))],
ignore_index=True)
pal = dict(df1="red", df2="blue")
g = sns.FacetGrid(df, hue='df', palette=pal, size=5);
g.map(plt.scatter, "x", "y", s=50, alpha=.7, linewidth=.5, edgecolor="white")
g.map(sns.regplot, "x", "y", ci=None, robust=1)
g.add_legend();
This will yield this plot:
Which is if I understand correctly is what you need.
Note that you will need to pay attention to .regplot parameters and may want to change the values I have put as an example.
; at the end of the line is to suppress output of the command (I use ipython notebook where it's visible).
Docs give some explanation on the .map() method. In essence, it does just that, maps plotting command with data. However it will work with 'low-level' plotting commands like regplot, and not lmlplot, which is actually calling regplot behind the scene.
Normally plt.scatter would take parameters: c='none', edgecolor='r' to make non-filled markers. But seaborn is interfering the process and enforcing color to the markers, so I don't see an easy/straigtforward way to fix this, but to manipulate ax elements after seaborn has produced the plot, which is best to be addressed as part of a different question.
Option 1: sns.regplot
In this case, the easiest to implement solution is to use sns.regplot, which is an axes-level function, because this will not require combining df1 and df2.
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
# create the figure and axes
fig, ax = plt.subplots(figsize=(6, 6))
# add the plots for each dataframe
sns.regplot(x='x1', y='y1', data=df1, fit_reg=True, ci=None, ax=ax, label='df1')
sns.regplot(x='x2', y='y2', data=df2, fit_reg=True, ci=None, ax=ax, label='df2')
ax.set(ylabel='y', xlabel='x')
ax.legend()
plt.show()
Option 2: sns.lmplot
As per sns.FacetGrid, it is better to use figure-level functions than to use FacetGrid directly.
Combine df1 and df2 into a long format, and then use sns.lmplot with the hue parameter.
When working with seaborn, it is almost always necessary for the data to be in a long format.
It's customary to use pandas.DataFrame.stack or pandas.melt to convert DataFrames from wide to long.
For this reason, df1 and df2 must have the columns renamed, and have an additional identifying column. This allows them to be concatenated on axis=0 (the default long format), instead of axis=1 (a wide format).
There are a number of ways to combine the DataFrames:
The combination method in the answer from Primer is fine if combining a few DataFrames.
However, a function, as shown below, is better for combining many DataFrames.
def fix_df(data: pd.DataFrame, name: str) -> pd.DataFrame:
"""rename columns and add a column"""
# rename columns to a common name
data.columns = ['x', 'y']
# add an identifying value to use with hue
data['df'] = name
return data
# create a list of the dataframes
df_list = [df1, df2]
# update the dataframes by calling the function in a list comprehension
df_update_list = [fix_df(v, f'df{i}') for i, v in enumerate(df_list, 1)]
# combine the dataframes
df = pd.concat(df_update_list).reset_index(drop=True)
# plot the dataframe
sns.lmplot(data=df, x='x', y='y', hue='df', ci=None)
Notes
Package versions used for this answer:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Plotting percentage in seaborn bar plot

For a dataframe
import pandas as pd
df=pd.DataFrame({'group':list("AADABCBCCCD"),'Values':[1,0,1,0,1,0,0,1,0,1,0]})
I am trying to plot a barplot showing percentage of times A, B, C, D takes zero (or one).
I have a round about way which works but I am thinking there has to be more straight forward way
tempdf=df.groupby(['group','Values']).Values.count().unstack().fillna(0)
tempdf['total']=df['group'].value_counts()
tempdf['percent']=tempdf[0]/tempdf['total']*100
tempdf.reset_index(inplace=True)
print tempdf
sns.barplot(x='group',y='percent',data=tempdf)
If it were plotting just the mean value, I could simply do sns.barplot on df dataframe than tempdf. I am not sure how to do it elegantly if I am interested in plotting percentages.
Thanks,
You can use Pandas in conjunction with seaborn to make this easier:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("tips")
x, y, hue = "day", "proportion", "sex"
hue_order = ["Male", "Female"]
(df[x]
.groupby(df[hue])
.value_counts(normalize=True)
.rename(y)
.reset_index()
.pipe((sns.barplot, "data"), x=x, y=y, hue=hue))
You could use your own function in sns.barplot estimator, as from docs:
estimator : callable that maps vector -> scalar, optional
Statistical function to estimate within each categorical bin.
For you case you could define function as lambda:
sns.barplot(x='group', y='Values', data=df, estimator=lambda x: sum(x==0)*100.0/len(x))
You can follow these steps so that you can see the count and percentages on top of the bars in your plot. Check the example outputs down below
with_hue function will plot percentages on the bar graphs if you have the 'hue' parameter in your plots. It takes the actual graph, feature, Number_of_categories in feature, and hue_categories(number of categories in hue feature) as a parameter.
without_hue function will plot percentages on the bar graphs if you have a normal plot. It takes the actual graph and feature as a parameter.
def with_hue(plot, feature, Number_of_categories, hue_categories):
a = [p.get_height() for p in plot.patches]
patch = [p for p in plot.patches]
for i in range(Number_of_categories):
total = feature.value_counts().values[i]
for j in range(hue_categories):
percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
def without_hue(plot, feature):
total = len(feature)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
You can use the library Dexplot, which has the ability to return relative frequencies for categorical variables. It has a similar API to Seaborn. Pass the column you would like to get the relative frequency for to the count function. If you would like to subdivide this by another column, do so with the split parameter. The following returns raw counts.
import dexplot as dxp
dxp.count('group', data=df, split='Values')
To get the relative frequencies, set the normalize parameter to the column you want to normalize over. Use True to normalize over the overall total count.
dxp.count('group', data=df, split='Values', normalize='group')
Normalizing over the 'Values' column would produce the following graph, where the total of all the '0' bars are 1.
dxp.count('group', data=df, split='Values', normalize='Values')

Categories

Resources