Creating box plots by looping multiple columns - python

I am trying to create multiple box plot charts for about 5 columns in my dataframe (df_summ):
columns = ['dimension_a','dimension_b']
for i in columns:
sns.set(style = "ticks", palette = "pastel")
box_plot = sns.boxplot(y="measure", x=i,
palette=["m","g"],
data=df_summ_1500_delta)
sns.despine(offset=10, trim=True)
medians = df_summ_1500_delta.groupby([i])['measure'].median()
vertical_offset=df_summ_1500_delta['measure'].median()*-0.5
for xtick in box_plot.get_xticks():
box_plot.text(xtick,medians[xtick] + vertical_offset,medians[xtick],
horizontalalignment='center',size='small',color='blue',weight='semibold')
My only issue is that they aren't be separated on different facets, but rather on top of each other.
Any help on how I can make both on their own separate chart with the x axis being 'dimension a' and the x axis of the second chart being 'dimension b'.

To draw two boxplots next to each other at each x-position, you can use a hue for dimension_a and dimension_b separately. These two columns need to be transformed (with pd.melt()) to "long form".
Here is a some example code starting from generated test data. Note that the order both for the x-values as for the hue-values needs to be enforced to be sure of their exact position. The individual box plots are distributed over a width of 0.8.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame({'dimension_a': np.random.choice(['hot', 'cold'], 100),
'dimension_b': np.random.choice(['hot', 'cold'], 100),
'measure': np.random.uniform(100, 500, 100)})
df.loc[df['dimension_a'] == 'hot', 'measure'] += 100
df.loc[df['dimension_a'] == 'cold', 'measure'] -= 100
x_order = ['hot', 'cold']
columns = ['dimension_a', 'dimension_b']
df1 = df.melt(value_vars=columns, var_name='dimension', value_name='value', id_vars='measure')
sns.set(style="ticks", palette="pastel")
ax = sns.boxplot(data=df1, x='value', order=x_order, y='measure',
hue='dimension', hue_order=columns, palette=["m", "g"], dodge=True)
ax.set_xlabel('')
sns.despine(offset=10, trim=True)
for col, dodge_dist in zip(columns, np.linspace(-0.4, 0.4, 2 * len(x_order) + 1)[1::2]):
medians = df.groupby([col])['measure'].median()
vertical_offset = df['measure'].median() * -0.5
for x_ind, xtick in enumerate(x_order):
ax.text(x_ind + dodge_dist, medians[xtick] + vertical_offset, f'{medians[xtick]:.2f}',
horizontalalignment='center', size='small', color='blue', weight='semibold')
plt.show()

Related

seaborn jointplot prints partial legend

I'm getting something weird with the legend in a seaborn jointplot. I want to plot some quantity y as function of a quantity x for 8 different datasets. These datasets have only two columns for x and y and a different number of rows. First of all I concatenate all rows of all datasets using numpy
y = np.concatenate(((data1[:,1]), (data2[:,1]), (data3[:,1]), (data4[:,1]),(data5[:,1]), (data6[:,1]), (data7[:,1]), (data8[:,1])), axis=0)
x = np.concatenate(((data1[:,0]), (data2[:,0]), (data3[:,0]), (data4[:,0]), (data5[:,0]), (data6[:,0]), (data7[:,0]), (data8[:,0])), axis=0)
Then I create the array of values which I will use for the parameter "hue" in the jointplot, which will distinguish the several datasets in the legend/colors. I do this by assigning at every dataset one number from 1 to 8,which is repeated for every row of the cumulative dataset:
indexes = np.concatenate((np.ones(len(data1[:,0])), 2*np.ones(len(data2[:,0])), 3*np.ones(len(data3[:,0])), 4*np.ones(len(data4[:,0])), 5*np.ones(len(data5[:,0])), 6*np.ones(len(data6[:,0])), 7*np.ones(len(data7[:,0])), 8*np.ones(len(data8[:,0]))), axis=0)
Then I create the dataset:
all_together = np.column_stack((x, y, indexes))
df = pd.DataFrame(all_together, columns = ['x','y','Dataset'])
So now I can create the jointplot. This is simply done by:
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo')
handles, labels = g.ax_joint.get_legend_handles_labels()
g.ax_joint.legend(handles=handles, labels=['data1', 'data2', 'data3', 'data4', 'data5', 'data6', 'data7', 'data8'], fontsize=10)
At this point, the problem is: all points are getting plotted (at least I think), but the legend only shows: data1, data2, data3, data4 and data5. I don't understand why it is not showing also the other three labels, and in this way the plot is difficult to read. I have checked and the cumulative dataset df has the correct shape. Any ideas?
You can add legend='full' to obtain a full legend. By default, sns.jointplot uses sns.scatterplot for the central plot. The keyword parameters which aren't used by jointplot are sent to scatterplot. The legend parameter can be "auto", "brief", "full", or False.
From the docs:
If “brief”, numeric hue and size variables will be represented with a sample of evenly spaced values. If “full”, every group will get an entry in the legend. If “auto”, choose between brief or full representation based on number of levels. If False, no legend data is added and no legend is drawn.
The following code is tested with seaborn 0.11.2:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
N = 200
k = np.repeat(np.arange(1, 9), N // 8)
df = pd.DataFrame({'x': 5 * np.cos(2 * k * np.pi / 8) + np.random.randn(N),
'y': 5 * np.sin(2 * k * np.pi / 8) + np.random.randn(N),
'Dataset': k})
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo', legend='full')
plt.show()

How to extend a matplotlib axis if the ticks are labels and not numeric?

I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.

How to plot multiple seaborn.distplot in a single figure

I want to plot multiple seaborn distplot under a same window, where each plot has the same x and y grid. My attempt is shown below, which does not work.
# function to plot the density curve of the 200 Median Stn. MC-losses
def make_density(stat_list,color, layer_num):
num_subplots = len(stat_list)
ncols = 3
nrows = (num_subplots + ncols - 1) // ncols
fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(ncols * 6, nrows * 5))
for i in range(len(stat_list)):
# Plot formatting
plt.title('Layer ' + layer_num)
plt.xlabel('Median Stn. MC-Loss')
plt.ylabel('Density')
plt.xlim(-0.2,0.05)
plt.ylim(0, 85)
min_ylim, max_ylim = plt.ylim()
# Draw the density plot.
sns.distplot(stat_list, hist = True, kde = True,
kde_kws = {'linewidth': 2}, color=color)
# `stat_list` is a list of 6 lists
# I want to draw histogram and density plot of
# each of these 6 lists contained in `stat_list` in a single window,
# where each row containing the histograms and densities of the 3 plots
# so in my example, there would be 2 rows of 3 columns of plots (2 x 3 =6).
stat_list = [[0.3,0.5,0.7,0.3,0.5],[0.2,0.1,0.9,0.7,0.4],[0.9,0.8,0.7,0.6,0.5]
[0.2,0.6,0.75,0.87,0.91],[0.2,0.3,0.8,0.9,0.3],[0.2,0.3,0.8,0.87,0.92]]
How can I modify my function to draw multiple distplot under the same window, where the x and y grid for each displayed plot is identical?
Thank you,
PS: Aside, I want the 6 distplots to have identical color, preferably green for all of them.
The easiest method is to load the data into pandas and then use seaborn.displot.
.displot replaces .distplot in seaborn version 0.11.0
Technically, what you would have wanted before, is a FacetGrid mapped with distplot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# data
stat_list = [[0.3,0.5,0.7,0.3,0.5], [0.2,0.1,0.9,0.7,0.4], [0.9,0.8,0.7,0.6,0.5], [0.2,0.6,0.75,0.87,0.91], [0.2,0.3,0.8,0.9,0.3], [0.2,0.3,0.8,0.87,0.92]]
# load the data into pandas and then transpose it for the correct column data
df = pd.DataFrame(stat_list).T
# name the columns; specify a layer number
df.columns = ['A', 'B', 'C', 'D', 'E', 'F']
# now stack the data into a long (tidy) format
dfl = df.stack().reset_index(level=1).rename(columns={'level_1': 'Layer', 0: 'Median Stn. MC-Loss'})
# plot a displot
g = sns.displot(data=dfl, x='Median Stn. MC-Loss', col='Layer', col_wrap=3, kde=True, color='green')
g.set_axis_labels(y_var='Density')
g.set(xlim=(0, 1.0), ylim=(0, 3.0))
sns.FacetGrid and sns.distplot
.distplot is deprecated
p = sns.FacetGrid(data=dfl, col='Layer', col_wrap=3, height=5)
p.map(sns.distplot, 'Median Stn. MC-Loss', bins=5, kde=True, color='green')
p.set(xlim=(0, 1.0))

Legends disappear when {"hist":False} in seaborn distplot

I have the following function:
Say hue="animals have three categories dog,bird,horse and we have two dataframes df_m and df_f consisting of data of male animals and women animals only, respectively.
The function plots three distplot of y (e.g y="weight") one for each hue={dog,bird,horse}. In each subplot we plot df_m[y] and df_f[y] such that I can compare the weight of male dogs/female dogs, male birds/female birds, male horses/female horses.
If I set distkwargs={"hist":False} when calling the function the legends ["F","M"] disappears, for some reason. Having distkwargs={"hist":True}` shows the legends
def plot_multi_kde_cat(self,dfs,y,hue,subkwargs={},distkwargs={},legends=[]):
"""
Create a subplot multi_kde with categories in the same plot
dfs: List
- DataFrames for each category e.g one for male and one for females
hue: string
- column for which each category is plotted (in each subplot)
"""
hues = dfs[0][hue].cat.categories
if len(hues)==2: #Only two categories
fig,axes = plt.subplots(1,2,**subkwargs) #Get axes and flatten them
axes=axes.flatten()
for ax,hu in zip(axes,hues):
for df in dfs:
sns.distplot(df.loc[df[hue]==hu,y],ax=ax,**distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend(legends)
else: #More than two categories: create a square grid and remove unsused axes
n_rows = int(np.ceil(np.sqrt(len(hues)))) #number of rows
fig,axes = plt.subplots(n_rows,n_rows,**subkwargs)
axes = axes.flatten()
for ax,hu in zip(axes,hues):
for df in dfs:
sns.distplot(df.loc[df[hue]==hu,y],ax=ax,**distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend(legends)
n_remove = len(axes)-len(hues) #number of axes to remove
if n_remove>0:
for ax in axes[-n_remove:]:
ax.set_visible(False)
fig.tight_layout()
return fig,axes
You can work around the problem by explicitly providing the label to the distplot. This forces a legend entry for each distplot. ax.legend() then already gets the correct labels.
Here is some minimal sample code to illustrate how everything works together:
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
def plot_multi_kde_cat(dfs, y, hue, subkwargs={}, distkwargs={}, legends=[]):
hues = np.unique(dfs[0][hue])
fig, axes = plt.subplots(1, len(hues), **subkwargs)
axes = axes.flatten()
for ax, hu in zip(axes, hues):
for df, legend_label in zip(dfs, legends):
sns.distplot(df.loc[df[hue] == hu, y], ax=ax, label=legend_label, **distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend()
N = 20
df_m = pd.DataFrame({'animal': np.random.choice(['tiger', 'horse'], N), 'weight': np.random.uniform(100, 200, N)})
df_f = pd.DataFrame({'animal': np.random.choice(['tiger', 'horse'], N), 'weight': np.random.uniform(80, 160, N)})
plot_multi_kde_cat([df_m, df_f], 'weight', 'animal',
subkwargs={}, distkwargs={'hist': False}, legends=['male', 'female'])
plt.show()

How to add median value labels to a Seaborn boxplot using the hue argument

In addition to the solution posted in this link I would also like if I can also add the Hue Parameter, and add the Median Values in each of the plots.
The Current Code:
testPlot = sns.boxplot(x='Pclass', y='Age', hue='Sex', data=trainData)
m1 = trainData.groupby(['Pclass', 'Sex'])['Age'].median().values
mL1 = [str(np.round(s, 2)) for s in m1]
p1 = range(len(m1))
for tick, label in zip(p1, testPlot.get_xticklabels()):
print(testPlot.text(p1[tick], m1[tick] + 1, mL1[tick]))
Gives a Output Like:
I'm working on the Titanic Dataset which can be found in this link.
I'm getting the required values, but only when I do a print statement, how do I include it in my Plot?
Place your labels manually according to hue parameter and width of bars for every category in a cycle of all xticklabels:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
trainData = pd.read_csv('titanic.csv')
testPlot = sns.boxplot(x='pclass', y='age', hue='sex', data=trainData)
m1 = trainData.groupby(['pclass', 'sex'])['age'].median().values
mL1 = [str(np.round(s, 2)) for s in m1]
ind = 0
for tick in range(len(testPlot.get_xticklabels())):
testPlot.text(tick-.2, m1[ind+1]+1, mL1[ind+1], horizontalalignment='center', color='w', weight='semibold')
testPlot.text(tick+.2, m1[ind]+1, mL1[ind], horizontalalignment='center', color='w', weight='semibold')
ind += 2
plt.show()
This answer is nearly copy & pasted from here but fit more to your example code. The linked answer is IMHO a bit missplaced there because that question is just about labeling a boxplot and not about a boxplot using the hue argument.
I couldn't use your Train dataset because it is not available as Python package. So I used Titanic instead which has nearly the same column names.
#!/usr/bin/env python3
import pandas as pd
import matplotlib
import matplotlib.patheffects as path_effects
import seaborn as sns
def add_median_labels(ax, fmt='.1f'):
"""Credits: https://stackoverflow.com/a/63295846/4865723
"""
lines = ax.get_lines()
boxes = [c for c in ax.get_children() if type(c).__name__ == 'PathPatch']
lines_per_box = int(len(lines) / len(boxes))
for median in lines[4:len(lines):lines_per_box]:
x, y = (data.mean() for data in median.get_data())
# choose value depending on horizontal or vertical plot orientation
value = x if (median.get_xdata()[1] - median.get_xdata()[0]) == 0 else y
text = ax.text(x, y, f'{value:{fmt}}', ha='center', va='center',
fontweight='bold', color='white')
# create median-colored border around white text for contrast
text.set_path_effects([
path_effects.Stroke(linewidth=3, foreground=median.get_color()),
path_effects.Normal(),
])
df = sns.load_dataset('titanic')
plot = sns.boxplot(x='pclass', y='age', hue='sex', data=df)
add_median_labels(plot)
plot.figure.show()
Als an alternative when you create your boxplot with a figure-based function. In that case you need to give the axes parameter to add_median_labels().
# imports and add_median_labels() unchanged
df = sns.load_dataset('titanic')
plot = sns.catplot(kind='box', x='pclass', y='age', hue='sex', data=df)
add_median_labels(plot.axes[0][0])
plot.figure.show()
The resulting plot
This solution also works with more then two categories in the column used for the hue argument.

Categories

Resources