Side-by-side boxplots with Pandas - python

I need to plot comparison of five variable, stored in pandas dataframe. I used an example from here, it worked, but now I need to change the axes and titles, but I'm struggling to do so.
Here is my data:
df1.groupby('cls').head()
Out[171]:
sensitivity specificity accuracy ppv auc cls
0 0.772091 0.824487 0.802966 0.799290 0.863700 sig
1 0.748931 0.817238 0.776366 0.785910 0.859041 sig
2 0.774016 0.805909 0.801975 0.789840 0.853132 sig
3 0.826670 0.730071 0.795715 0.784150 0.850024 sig
4 0.781112 0.803839 0.824709 0.791530 0.863411 sig
0 0.619048 0.748290 0.694969 0.686138 0.713899 baseline
1 0.642348 0.702076 0.646216 0.674683 0.712632 baseline
2 0.567344 0.765410 0.710650 0.665614 0.682502 baseline
3 0.644046 0.733645 0.754621 0.683485 0.734299 baseline
4 0.710077 0.653871 0.707933 0.684313 0.732997 baseline
Here is my code:
>> fig, axes = plt.subplots(ncols=5, figsize=(12, 5), sharey=True)
>> df1.query("cls in ['sig', 'baseline']").boxplot(by='cls', return_type='axes', ax=axes)
And the resulting pictures are:
How to:
change the title ('Boxplot groupped by cls')
get rid of annoying [cls] plotted along the horizontal line
reorder the plotted categories as they appear in df1? (first sensitivity, followed by speci...)

I suggest using seaborn
Here is an example that might help you:
Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
Make data
data = {'sensitivity' : np.random.normal(loc = 0, size = 10),
'specificity' : np.random.normal(loc = 0, size = 10),
'accuracy' : np.random.normal(loc = 0, size = 10),
'ppv' : np.random.normal(loc = 0, size = 10),
'auc' : np.random.normal(loc = 0, size = 10),
'cls' : ['sig', 'sig', 'sig', 'sig', 'sig', 'baseline', 'baseline', 'baseline', 'baseline', 'baseline']}
df = pd.DataFrame(data)
df
Seaborn has a nifty tool called factorplot that creates a grid of subplots where the rows/cols are built with your data. To be able to do this, we need to "melt" the df into a more usable shape.
df_melt = df.melt(id_vars = 'cls',
value_vars = ['accuracy',
'auc',
'ppv',
'sensitivity',
'specificity'],
var_name = 'columns')
Now we can create the factorplot using the col "columns".
a = sns.factorplot(data = df_melt,
x = 'cls',
y = 'value',
kind = 'box', # type of plot
col = 'columns',
col_order = ['sensitivity', # custom order of boxplots
'specificity',
'accuracy',
'ppv',
'auc']).set_titles('{col_name}') # remove 'column = ' part of title
plt.show()
You can also just use Seaborn's boxplot.
b = sns.boxplot(data = df_melt,
hue = 'cls', # different colors for different 'cls'
x = 'columns',
y = 'value',
order = ['sensitivity', # custom order of boxplots
'specificity',
'accuracy',
'ppv',
'auc'])
sns.plt.title('Boxplot grouped by cls') # You can change the title here
plt.show()
This will give you the same plot but all in one figure instead of subplots. It also allows you to change the title of the figure with one line. Unfortunately I can't find a way to remove the 'columns' subtitle but hopefully this will get you what you need.
EDIT
To view the plots sideways:
Factorplot
Swap your x and y values, change col = 'columns' to row = 'columns', change col_order = [...] to row_order = [...], and change '{col_name}' to '{row_name}' like so
a1 = sns.factorplot(data = df_melt,
x = 'value',
y = 'cls',
kind = 'box', # type of plot
row = 'columns',
row_order = ['sensitivity', # custom order of boxplots
'specificity',
'accuracy',
'ppv',
'auc']).set_titles('{row_name}') # remove 'column = ' part of title
plt.show()
Boxplot
Swap your x and y values then add the parameter orient = 'h' like so
b1 = sns.boxplot(data = df_melt,
hue = 'cls',
x = 'value',
y = 'columns',
order = ['sensitivity', # custom order of boxplots
'specificity',
'accuracy',
'ppv',
'auc'],
orient = 'h')
sns.plt.title('Boxplot grouped by cls')
plt.show()

Maybe this helps you:
fig, axes = pyplot.subplots(ncols=4, figsize=(12, 5), sharey=True)
df.query("E in [1, 2]").boxplot(by='E', return_type='axes', ax=axes, column=list('bcda')) # Keeping original columns order
pyplot.suptitle('Boxplot') # Changing title
[ax.set_xlabel('') for ax in axes] # Changing xticks for all plots

Related

Is there a way to get a composite bar graph (aka stacked bar graph)?

I have the following code and would like to get a graph like the one labeled 'want', but I am instead getting one where there is overlapping in color. I believe pandas may have a built in graph like the one I am looking for, but maybe this graph I am generating could do the same.
UPDATE:
I was able to get the graph working, now I need for colors not to repeat. There is repetition of color for each 'State' (i.e. Dehli, etc.) Code has been updated to reflect the changes.
code:
data = Table.read_table('IndiaStatus.csv')#.drop('Discharged', 'Discharge Ratio (%)','Total Cases','Active','Deaths')
data2 = data.to_df()
cols = list(data2.columns)
cols.remove('State/UTs')
# now iterate over the remaining columns and create a new zscore column
for col in cols:
col_zscore = col + '_zscore'
data2[col_zscore] = (data2[col] - data2[col].mean())/data2[col].std(ddof=0)
print(data2)
data2.info()
data2["outlier"] = (abs(data2["Total Cases_zscore"])>1).astype(int)
print(data2)
delete_row = data2[data2["outlier"]== 1].index
data2 = data2.drop(delete_row)
print(data2)
data2["outlier2"] = ((data2["Active_zscore"])> 0.00).astype(int)
delete_row = data2[data2["outlier2"]== 1].index
data2 = data2.drop(delete_row)
'''
#Analyzing and removing outliers for Total Cases_zscore
sns.distplot(data2["Active_zscore"], kde = False, bins = 30)
g = sns.jointplot(x='Active_zscore', y='Active_zscore',
data=data2, hue='State/UTs')
plt.subplots_adjust(right=0.75)
g.ax_joint.legend(bbox_to_anchor=(1.25,1), loc='upper left', borderaxespad=0)
'''
print(data2)
print(data2.mean())
print(data2.std())
#data2.insert(1, column = "Level", value = np.where(data2["Active"] > 9700, "Severe", data["Active"] < 9700 & data["Active"] > 4850, 'Less_Severe','Not_Severe'))
col = 'Active'
conditions = [ data2['Active']<=600, data2['Active']<= 1200, data2['Active'] >1200 ]
choices = [ 'Not_Severe','Less_Severe',"Severe" ]
data2["Level"] = np.select(conditions, choices, default=np.nan)
print(data2)
ax=data2.pivot_table(index='Level', columns = 'State/UTs', values = 'Total Cases').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
#set ylim
#plt.ylim(-1, 20,5)
#plt.xlim(-1,4,8)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=8,loc=(1.0,0.2))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('Level',fontsize=20,color='r')
ax.set_ylabel('Total Cases',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)
Want:
Actual Output:
UPDATE:
Make graph background white usuing
sns.set_style("whitegrid")

Creating box plots by looping multiple columns

I am trying to create multiple box plot charts for about 5 columns in my dataframe (df_summ):
columns = ['dimension_a','dimension_b']
for i in columns:
sns.set(style = "ticks", palette = "pastel")
box_plot = sns.boxplot(y="measure", x=i,
palette=["m","g"],
data=df_summ_1500_delta)
sns.despine(offset=10, trim=True)
medians = df_summ_1500_delta.groupby([i])['measure'].median()
vertical_offset=df_summ_1500_delta['measure'].median()*-0.5
for xtick in box_plot.get_xticks():
box_plot.text(xtick,medians[xtick] + vertical_offset,medians[xtick],
horizontalalignment='center',size='small',color='blue',weight='semibold')
My only issue is that they aren't be separated on different facets, but rather on top of each other.
Any help on how I can make both on their own separate chart with the x axis being 'dimension a' and the x axis of the second chart being 'dimension b'.
To draw two boxplots next to each other at each x-position, you can use a hue for dimension_a and dimension_b separately. These two columns need to be transformed (with pd.melt()) to "long form".
Here is a some example code starting from generated test data. Note that the order both for the x-values as for the hue-values needs to be enforced to be sure of their exact position. The individual box plots are distributed over a width of 0.8.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame({'dimension_a': np.random.choice(['hot', 'cold'], 100),
'dimension_b': np.random.choice(['hot', 'cold'], 100),
'measure': np.random.uniform(100, 500, 100)})
df.loc[df['dimension_a'] == 'hot', 'measure'] += 100
df.loc[df['dimension_a'] == 'cold', 'measure'] -= 100
x_order = ['hot', 'cold']
columns = ['dimension_a', 'dimension_b']
df1 = df.melt(value_vars=columns, var_name='dimension', value_name='value', id_vars='measure')
sns.set(style="ticks", palette="pastel")
ax = sns.boxplot(data=df1, x='value', order=x_order, y='measure',
hue='dimension', hue_order=columns, palette=["m", "g"], dodge=True)
ax.set_xlabel('')
sns.despine(offset=10, trim=True)
for col, dodge_dist in zip(columns, np.linspace(-0.4, 0.4, 2 * len(x_order) + 1)[1::2]):
medians = df.groupby([col])['measure'].median()
vertical_offset = df['measure'].median() * -0.5
for x_ind, xtick in enumerate(x_order):
ax.text(x_ind + dodge_dist, medians[xtick] + vertical_offset, f'{medians[xtick]:.2f}',
horizontalalignment='center', size='small', color='blue', weight='semibold')
plt.show()

How to extend a matplotlib axis if the ticks are labels and not numeric?

I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.

How to plot multiple seaborn.distplot in a single figure

I want to plot multiple seaborn distplot under a same window, where each plot has the same x and y grid. My attempt is shown below, which does not work.
# function to plot the density curve of the 200 Median Stn. MC-losses
def make_density(stat_list,color, layer_num):
num_subplots = len(stat_list)
ncols = 3
nrows = (num_subplots + ncols - 1) // ncols
fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(ncols * 6, nrows * 5))
for i in range(len(stat_list)):
# Plot formatting
plt.title('Layer ' + layer_num)
plt.xlabel('Median Stn. MC-Loss')
plt.ylabel('Density')
plt.xlim(-0.2,0.05)
plt.ylim(0, 85)
min_ylim, max_ylim = plt.ylim()
# Draw the density plot.
sns.distplot(stat_list, hist = True, kde = True,
kde_kws = {'linewidth': 2}, color=color)
# `stat_list` is a list of 6 lists
# I want to draw histogram and density plot of
# each of these 6 lists contained in `stat_list` in a single window,
# where each row containing the histograms and densities of the 3 plots
# so in my example, there would be 2 rows of 3 columns of plots (2 x 3 =6).
stat_list = [[0.3,0.5,0.7,0.3,0.5],[0.2,0.1,0.9,0.7,0.4],[0.9,0.8,0.7,0.6,0.5]
[0.2,0.6,0.75,0.87,0.91],[0.2,0.3,0.8,0.9,0.3],[0.2,0.3,0.8,0.87,0.92]]
How can I modify my function to draw multiple distplot under the same window, where the x and y grid for each displayed plot is identical?
Thank you,
PS: Aside, I want the 6 distplots to have identical color, preferably green for all of them.
The easiest method is to load the data into pandas and then use seaborn.displot.
.displot replaces .distplot in seaborn version 0.11.0
Technically, what you would have wanted before, is a FacetGrid mapped with distplot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# data
stat_list = [[0.3,0.5,0.7,0.3,0.5], [0.2,0.1,0.9,0.7,0.4], [0.9,0.8,0.7,0.6,0.5], [0.2,0.6,0.75,0.87,0.91], [0.2,0.3,0.8,0.9,0.3], [0.2,0.3,0.8,0.87,0.92]]
# load the data into pandas and then transpose it for the correct column data
df = pd.DataFrame(stat_list).T
# name the columns; specify a layer number
df.columns = ['A', 'B', 'C', 'D', 'E', 'F']
# now stack the data into a long (tidy) format
dfl = df.stack().reset_index(level=1).rename(columns={'level_1': 'Layer', 0: 'Median Stn. MC-Loss'})
# plot a displot
g = sns.displot(data=dfl, x='Median Stn. MC-Loss', col='Layer', col_wrap=3, kde=True, color='green')
g.set_axis_labels(y_var='Density')
g.set(xlim=(0, 1.0), ylim=(0, 3.0))
sns.FacetGrid and sns.distplot
.distplot is deprecated
p = sns.FacetGrid(data=dfl, col='Layer', col_wrap=3, height=5)
p.map(sns.distplot, 'Median Stn. MC-Loss', bins=5, kde=True, color='green')
p.set(xlim=(0, 1.0))

Pandas plot: Assign Colors

I have many data frames that I am plotting for a presentation. These all have different columns, but all contain the same additional column foobar. At the moment, I am plotting these different data frames using
df.plot(secondary_y='foobar')
Unfortunately, since these data frames all have different additional columns with different ordering, the color of foobar is always different. This makes the presentation slides unnecessary complicated. I would like, throughout the different plots, assign that foobar is plotted bold and black.
Looking at the docs, the only thing coming close appears to be the parameter colormap - I would need to ensure that the xth color in the color map is always black, where x is the order of foobar in the data frame. Seems to be more complicated than it should be, also this wouldn't make it bold.
Is there a (better) approach?
I would suggest using matplotlib directly rather than the dataframe plotting methods. If df.plot returned the artists it added instead of an Axes object it wouldn't be too bad to change the color of the line after it was plotted.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
def pandas_plot(ax, df, callout_key):
"""
Parameters
----------
ax : mpl.Axes
The axes to draw to
df : DataFrame
Data to plot
callout_key : str
key to highlight
"""
artists = {}
x = df.index.values
for k, v in df.iteritems():
style_kwargs = {}
if k == callout_key:
style_kwargs['c'] = 'k'
style_kwargs['lw'] = 2
ln, = ax.plot(x, v.values, **style_kwargs)
artists[k] = ln
ax.legend()
ax.set_xlim(np.min(x), np.max(x))
return artists
Usage:
fig, ax = plt.subplots()
ax2 = ax.twinx()
th = np.linspace(0, 2*np.pi, 1024)
df = pd.DataFrame({'cos': np.cos(th), 'sin': np.sin(th),
'foo': np.sin(th + 1), 'bar': np.cos(th +1)}, index=th)
df2 = pd.DataFrame({'cos': -np.cos(th), 'sin': -np.sin(th)}, index=th)
pandas_plot(ax, df, 'sin')
pandas_plot(ax2, df2, 'sin')
Perhaps you could define a function which handles the special column in a separate plot call:
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
df[columns].plot(ax=ax)
df[col].plot(ax=ax, **emphargs)
Using code from tcaswell's example,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
df[columns].plot(ax=ax)
df[col].plot(ax=ax, **emphargs)
fig, ax = plt.subplots()
th = np.linspace(0, 2*np.pi, 1024)
df = pd.DataFrame({'cos': np.cos(th), 'foobar': np.sin(th),
'foo': np.sin(th + 1), 'bar': np.cos(th +1)}, index=th)
df2 = pd.DataFrame({'cos': -np.cos(th), 'foobar': -np.sin(th)}, index=th)
emphasize_plot(ax, df, 'foobar', lw=2, c='k')
emphasize_plot(ax, df2, 'foobar', lw=2, c='k')
plt.show()
yields
I used #unutbut's answer and extended it to allow for a secondary y axis and correct legends:
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
ax2 = ax.twinx()
df[columns].plot(ax=ax)
df[col].plot(ax=ax2, **emphargs)
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc=0)

Categories

Resources