matplotlib categorical bar chart creates unwanted whitespace - python

I have a dataframe that looks like this:
import numpy as np
import pandas as pd
location = list(range(1, 34))
location += [102, 172]
stress = np.random.randint(1,1000, len(location))
group = np.random.choice(['A', 'B'], len(location))
df = pd.DataFrame({'location':location, 'stress':stress, 'group':group})
df[['location', 'group']] = df[['location', 'group']].astype(str)
Note: location and group are both strings
I'm trying to create a a bar plot so that location (categorical) is on the x axis, and stress is the height of each bar. Furthermore, I want to color each bar with a different colour for each group
I've tried the following:
f, axarr = plt.subplots(1, 1)
axarr.bar(df['location'], df['stress'])
plt.xticks(np.arange(df.shape[0]) + 1, df['location'])
plt.show()
However, this produces:
I'm not sure why there are blank spaces between the end bars. I'm guessing its because of the 102 and 172 values in location, however, that column is a string so I'm expecting it to be treated as a categorical variable, with all bars placed next to each other regardless of location "value". I tried to correct for this by manually specifying the xtick location and labels but it didn't seem to work
Finally, is there a quick way to colour each bar by group without having to manually iterate over each unique group value?

If your location is categorical data, don't make your bar plot with that. Use np.arange(df.shape[0]) to make the bar plot and set ticklabels later:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
location = list(range(1, 34))
location += [102, 172]
stress = np.random.randint(1,1000, len(location))
group = np.random.choice(['A', 'B'], len(location))
df = pd.DataFrame({'location':location, 'stress':stress, 'group':group})
df[['location', 'group']] = df[['location', 'group']].astype(str)
f, axarr = plt.subplots(1, 1)
bars = axarr.bar(np.arange(df.shape[0]), df['stress'])
for b, g in zip(bars.patches, df['group']):
if g == 'A':
b.set_color('b')
elif g == 'B':
b.set_color('r')
plt.xticks(np.arange(df.shape[0]) + bars.patches[0].get_width() / 2, df['location'])
plt.setp(axarr.xaxis.get_ticklabels(), rotation=90)
plt.show()
Don't know if there is a concise way to set bar color in bulk. An iteration is not too bad...

Related

How can I add hatching for specific bars in sns.catplot?

I use seaborn to make a categorical barplot of a df containing Pearson correlation R-values for 17 vegetation classes, 3 carbon species and 4 regions. I try to recreate a smaller sample df here:
import pandas as pd
import seaborn as sns
import random
import numpy as np
df = pd.DataFrame({
'veg class':12*['Tree bl dc','Shrubland','Grassland'],
'Pearson R':np.random.uniform(0,1, 36),
'Pearson p':np.random.uniform(0,0.1, 36),
'carbon':4*['CO2','CO2','CO2', 'CO', 'CO', 'CO', 'CO2 corr', 'CO2 corr', 'CO2 corr'],
'spatial':9*['SH'] + 9*['larger AU region'] + 9*['AU'] + 9*['SE-AU']
})
#In my original df, the number of vegetation classes where R-values are
#available is not the same for all spatial scales, so I drop random rows
#to make it more similar:
df.drop([11,14,17,20,23,26,28,29,31,32,34,35], inplace=True)
#I added colums indicating where hatching should be
#boolean:
df['significant'] = 1
df.loc[df['Pearson p'] > 0.05, 'significant'] = 0
#string:
df['hatch'] = ''
df.loc[df['Pearson p'] > 0.05, 'hatch'] = 'x'
df.head()
This is my plotting routine:
sns.set(font_scale=2.1)
#Draw a nested barplot by veg class
g = sns.catplot(
data=df, kind="bar", row="spatial",
x="veg class", y="Pearson R", hue="carbon",
ci=None, palette="YlOrBr", aspect=5
)
g.despine(left=True)
g.set_titles("{row_name}")
g.set_axis_labels("", "Pearson R")
g.set(xlabel=None)
g.legend.set_title("")
g.set_xticklabels(rotation = 60)
(The plot looks as follows: seaborn categorical barplot)
The plot is exactly how I would like it, except that now I would like to add hatching (or any kind of distinction) for all bars where the Pearson R value is insignificant, i.e. where the p value is larger than 0.05. I found this stackoverflow entry, but my problem differs from this, as the plots that should be hatched are not in repetitive order.
Any hints will be highly appreciated!
To determine the height of individual bars and hatching, we get a container for each graph unit, get the height of that individual container, determine it with a specified threshold, and then set the hatching and color. Please add the following code at the end.
for ax in g.axes.flat:
for k in range(len(ax.containers)):
h = ax.patches[k].get_height()
if h >= 0.8:
ax.patches[k].set_hatch('*')
ax.patches[k].set_edgecolor('k')
Edit: The data has been updated to match the actual data, and the code has been modified accordingly. Also, the logic is conditional on the value of the hatching column.
for i,ax in enumerate(g.axes.flat):
s = ax.get_title()
dff = df.query('spatial == #s')
dff = dff.sort_values('veg class', ascending=False)
ha = dff['hatch'].tolist()
p = dff['Pearson R'].tolist()
print(ha)
for k in range(len(dff)):
if ha[k] == 'x':
ax.patches[k].set_hatch('*')
ax.patches[k].set_edgecolor('k')

How to extend a matplotlib axis if the ticks are labels and not numeric?

I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.

Is it possible to have a given number (n>2) of y-axes in matplotlib?

prices = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
I have my prices dataframe, and it currently has 3 columns. But at other times, it could have more or fewer columns. Is there a way to use some sort of twinx() loop to create a line-chart of all the different timeseries with a (potentially) infinite number of y-axes?
I tried the double for loop below but I got typeError'd:bTypeError: 'AxesSubplot' object does not support item assignment
# for i in range(0,len(prices.columns)):
# for column in list(prices.columns):
# fig, ax[i] = plt.subplots()
# ax[i].set_xlabel(prices.index())
# ax[i].set_ylabel(column[i])
# ax[i].plot(prices.Date, prices[column])
# ax[i].tick_params(axis ='y')
#
# ax[i+1] = ax[i].twinx()
# ax[i+1].set_ylabel(column[i+1])
# ax[i+1].plot(prices.Date, column[i+1])
# ax[i+1].tick_params(axis ='y')
#
# fig.suptitle('matplotlib.pyplot.twinx() function \ Example\n\n', fontweight ="bold")
# plt.show()
# =============================================================================
I believe I understand why I got the error - the ax object does not allow the assignment of the i variable. I'm hoping there is some ingenious way to accomplish this.
Turned out, the main problem was that you should not mix pandas plotting function with matplotlib which led to a duplication of the axes. Otherwise, the implementation is rather straight forward adapted from this matplotlib example.
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from matplotlib import pyplot as plt
from itertools import cycle
import pandas as pd
#fake data creation with different spread for different axes
#this entire block can be deleted if you import your df
from pandas._testing import rands_array
import numpy as np
fakencol=5
fakenrow=7
np.random.seed(20200916)
df = pd.DataFrame(np.random.randint(1, 10, fakenrow*fakencol).reshape(fakenrow, fakencol), columns=rands_array(2, fakencol))
df = df.multiply(np.power(np.asarray([10]), np.arange(fakencol)))
df.index = pd.date_range("20200916", periods=fakenrow)
#defining a color scheme with unique colors
#if you want to include more than 20 axes, well, what can I say
sc_color = cycle(plt.cm.tab20.colors)
#defining the size of the figure in relation to the number of dataframe columns
#might need adjustment for optimal data presentation
offset = 60
plt.rcParams['figure.figsize'] = 10+df.shape[1], 5
#host figure and first plot
host = host_subplot(111, axes_class=AA.Axes)
h, = host.plot(df.index, df.iloc[:, 0], c=next(sc_color), label=df.columns[0])
host.set_ylabel(df.columns[0])
host.axis["left"].label.set_color(h.get_color())
host.set_xlabel("time")
#plotting the rest of the axes
for i, cols in enumerate(df.columns[1:]):
curr_ax = host.twinx()
new_fixed_axis = curr_ax.get_grid_helper().new_fixed_axis
curr_ax.axis["right"] = new_fixed_axis(loc="right",
axes=curr_ax,
offset=(offset*i, 0))
curr_p, = curr_ax.plot(df.index, df[cols], c=next(sc_color), label=cols)
curr_ax.axis["right"].label.set_color(curr_p.get_color())
curr_ax.set_ylabel(cols)
curr_ax.yaxis.label.set_color(curr_p.get_color())
plt.legend()
plt.tight_layout()
plt.show()
Coming to think of it - it would probably have been better to distribute the axes equally to the left and the right of the plot. Oh, well.

Pandas: label with a given set of yaxis values category plot

The csv has the following values
Name Grade
Jack B
Jill C
The labels for the y-axis are B and C from the CSV. But i want the y axis to contain all the grades- A,B,C,D,F .This plots only the given values in the y-axis(B,C),
ax = sns.catplot(x = "Name", y = "Grade")
Is there any possible way to give all the grades in the y-axis to plot.
When you call sns.catplot() without the kind argument, it invokes the default sns.stripplot, which only works if y is numerical. So if you really want this kind of plot, you should code the grades as numbers. You can still show the grade letters in the plot, by assigning them as labels:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# code grades as numbers (A: 1 etc.)
df = pd.DataFrame({'Name': ['Jack', 'Jill'],
'Grade': [2, 3]})
# catplot (i.e. the default stripplot) works, as y is numerical
sns.catplot(x='Name', y='Grade', data=df)
# provide y tick positions and labels (translate numbers back to grade letters)
plt.yticks(range(1, 7), [chr(ord('A') + i) for i in range(6)])
Edit: If you want to have A on top, just add this line at the end:
plt.gca().invert_yaxis()

Plotting colored lines connecting individual data points of two swarmplots

I have:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
# Generate random data
set1 = np.random.randint(0, 40, 24)
set2 = np.random.randint(0, 100, 24)
# Put into dataframe and plot
df = pd.DataFrame({'set1': set1, 'set2': set2})
data = pd.melt(df)
sb.swarmplot(data=data, x='variable', y='value')
The two random distributions plotted with seaborn's swarmplot function:
I want the individual plots of both distributions to be connected with a colored line such that the first data point of set 1 in the dataframe is connected with the first data point of set 2.
I realize that this would probably be relatively simple without seaborn but I want to keep the feature that the individual data points do not overlap.
Is there any way to access the individual plot coordinates in the seaborn swarmfunction?
EDIT: Thanks to #Mead, who pointed out a bug in my post prior to 2021-08-23 (I forgot to sort the locations in the prior version).
I gave the nice answer by Paul Brodersen a try, and despite him saying that
Madness lies this way
... I actually think it's pretty straight forward and yields nice results:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Generate random data
rng = np.random.default_rng(42)
set1 = rng.integers(0, 40, 5)
set2 = rng.integers(0, 100, 5)
# Put into dataframe
df = pd.DataFrame({"set1": set1, "set2": set2})
print(df)
data = pd.melt(df)
# Plot
fig, ax = plt.subplots()
sns.swarmplot(data=data, x="variable", y="value", ax=ax)
# Now connect the dots
# Find idx0 and idx1 by inspecting the elements return from ax.get_children()
# ... or find a way to automate it
idx0 = 0
idx1 = 1
locs1 = ax.get_children()[idx0].get_offsets()
locs2 = ax.get_children()[idx1].get_offsets()
# before plotting, we need to sort so that the data points
# correspond to each other as they did in "set1" and "set2"
sort_idxs1 = np.argsort(set1)
sort_idxs2 = np.argsort(set2)
# revert "ascending sort" through sort_idxs2.argsort(),
# and then sort into order corresponding with set1
locs2_sorted = locs2[sort_idxs2.argsort()][sort_idxs1]
for i in range(locs1.shape[0]):
x = [locs1[i, 0], locs2_sorted[i, 0]]
y = [locs1[i, 1], locs2_sorted[i, 1]]
ax.plot(x, y, color="black", alpha=0.1)
It prints:
set1 set2
0 3 85
1 30 8
2 26 69
3 17 20
4 17 9
And you can see that the data is linked correspondingly in the plot.
Sure, it's possible (but you really don't want to).
seaborn.swarmplot returns the axis instance (here: ax). You can grab the children ax.get_children to get all plot elements. You will see that for each set of points there is an element of type PathCollection. You can determine the x, y coordinates by using the PathCollection.get_offsets() method.
I do not suggest you do this! Madness lies this way.
I suggest you have a look at the source code (found here), and derive your own _PairedSwarmPlotter from _SwarmPlotter and change the draw_swarmplot method to your needs.

Categories

Resources