For some reason xticks on my histogram are shifted:
Here is the code:
data = list(df['data'].to_numpy())
bin = 40
plt.style.use('seaborn-colorblind')
plt.grid(axis='y', alpha=0.5, linestyle='--')
plt.hist(data, bins=bin, rwidth=0.7, align='mid')
plt.yticks(np.arange(0, 13000, 1000))
ticks = np.arange(0, 100000, 2500)
plt.xticks(ticks, rotation='-90', ha='center')
plt.show()
Im wondering why x ticks are shifted at the very beginning of the xaxis.
When setting bins=40, 40 equally sized bins will be created between the lowest and highest data value. In this case, the highest data value seems to be around 90000, and the lowest about 0. Dividing this into 40 regions will result in boundaries with non-rounded values. Therefore, it seems better to explicitly set the bins boundaries to the values you really want, for example dividing the range 0-100000 into 40 (so 41 boundaries).
from matplotlib import pyplot as plt
import numpy as np
plt.style.use('seaborn-colorblind')
data = np.random.lognormal(10, 0.4, 100000)
data[data > 90000] = np.nan
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
for ax in axes:
if ax == axes[0]:
bins = 40
ax.set_title('bins = 40')
else:
bins = np.linspace(0, 100000, 41)
ax.set_title('bins = np.linspace(0, 100000, 41)')
ax.grid(axis='y', alpha=0.5, linestyle='--')
ax.hist(data, bins=bins, rwidth=0.7, align='mid')
ax.set_yticks(np.arange(0, 13000, 1000))
xticks = np.arange(0, 100000, 2500)
ax.set_xticks(xticks)
ax.tick_params(axis='x', labelrotation=-90)
plt.tight_layout()
plt.show()
The issue is related to the way bins are constructed.
You have two choices:
Set the range for bins directly
plt.hist(data, bins=bin, rwidth=0.7, range=(0, 100_000), align='mid')
Set x axis accordingly to the binning:
_, bin_edges, _ = plt.hist(data, bins=bin, rwidth=0.7, align='mid')
ticks = bin_edges
I recommend the 2. option. The histogram will have a more natural scale comparing to the boundaries of bins.
Related
I am trying to plot a histogram of exponential distribution ranging from 0 to 20 with mean value 2.2 and bin width 0.05. However, the bar color became white as I am plotting it. The following is my code:
bins = np.linspace(0, 20, 401)
x = np.random.exponential(2.2, 3000)
counts, _ = np.histogram(x, bins)
df = pd.DataFrame({'bin': bins[:-1], 'count': counts})
p = sns.catplot(data = df, x = 'bin', y = 'count', yerr = [i**(1/2) for i in counts], kind = 'bar', height = 4, aspect = 2, palette = 'Dark2_r')
p.set(xlabel = 'Muon decay times ($\mu s$)', ylabel = 'Count', title = 'Distribution for muon decay times')
for ax in p.axes.flat:
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
if (i%40 != 0):
labels[i] = ""
ax.set_xticklabels(labels, rotation=30)
I believe that this is caused by the number of bins. If the first line of the codes are set to bins = np.linspace(0, 20, 11), the plot would be:
But I have no idea how to resolve this.
As #JohanC points out, if you're trying to draw elements that are close to or smaller than the resolution of your raster graphic, you have to expect some artifacts. But it also seems like you'd have an easier time making this plot directly in matplotlib, since catplot is not designed to make histograms:
f, ax = plt.subplots(figsize=(8, 4), dpi=96)
ax.bar(
bins[:-1], counts,
yerr=[i**(1/2) for i in counts],
width=(bins[1] - bins[0]), align="edge",
linewidth=0, error_kw=dict(linewidth=1),
)
ax.set(
xmargin=.01,
xlabel='Muon decay times ($\mu s$)',
ylabel='Count',
title='Distribution for muon decay times'
)
Matplotlib doesn't have a good way to deal with bars that are thinner than one pixel. If you save to an image file, you can increase the dpi and/or the figsize.
Some white space is due to the bars being 0.8 wide, leaving a gap of 0.2. Seaborn's barplot doesn't let you set the bar widths, but you could iterate through the generated bars and change their width (also updating their x-value to keep them centered around the tick position).
The edges of the bars get a fixed color (default 'none', or fully transparent). While iterating through the generated bars, you could set the edge color equal to the face color.
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns
import pandas as pd
import numpy as np
bins = np.linspace(0, 20, 401)
x = np.random.exponential(2.2, 3000)
counts, _ = np.histogram(x, bins)
df = pd.DataFrame({'bin': bins[:-1], 'count': counts})
g = sns.catplot(data=df, x='bin', y='count', yerr=[i ** (1 / 2) for i in counts], kind='bar',
height=4, aspect=2, palette='Dark2_r', lw=0.5)
g.set(xlabel='Muon decay times ($\mu s$)', ylabel='Count', title='Distribution for muon decay times')
for ax in g.axes.flat:
ax.xaxis.set_major_locator(MultipleLocator(40))
ax.tick_params(axis='x', labelrotation=30)
for bar in ax.patches:
bar.set_edgecolor(bar.get_facecolor())
bar.set_x(bar.get_x() - (1 - bar.get_width()) / 2)
bar.set_width(1)
plt.tight_layout()
plt.show()
I have created a histogram in a Jupyter notebook to show the distribution of time on page in seconds for 100 web visits.
Code as follows:
ax = df.hist(column='time_on_page', bins=25, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
# Despine
x.spines['right'].set_visible(False)
x.spines['top'].set_visible(False)
x.spines['left'].set_visible(False)
# Switch off ticks
x.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")
# Draw horizontal axis lines
vals = x.get_yticks()
for tick in vals:
x.axhline(y=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)
# Set title
x.set_title("Time on Page Histogram", fontsize=20, weight='bold', size=12)
# Set x-axis label
x.set_xlabel("Time on Page Duration (Seconds)", labelpad=20, weight='bold', size=12)
# Set y-axis label
x.set_ylabel("Page Views", labelpad=20, weight='bold', size=12)
# Format y-axis label
x.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
This produces the following visualisation:
I'm generally happy with the appearance however I'd like for the axis to be a little more descriptive, perhaps showing the bin range for each bin and the percentage of the total that each bin constitutes.
Have looked for this in the Matplotlib documentation but cannot seem ot find anything that would allow me to achieve my end goal.
Any help greatly appreciated.
When you set bins=25, 25 equally spaced bins are set between the lowest and highest values encountered. If you use these ranges to mark the bins, things can be confusing due to the arbitrary values. It seems more adequate to round these bin boundaries, for example to multiples of 20. Then, these values can be used as tick marks on the x-axis, nicely between the bins.
The percentages can be added by looping through the bars (rectangular patches). Their height indicates the number of rows belonging to the bin, so dividing by the total number of rows and multiplying by 100 gives a percentage. The bar height, x and half width can position the text.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'time_on_page': np.random.lognormal(4, 1.1, 100)})
max_x = df['time_on_page'].max()
bin_width = max(20, np.round(max_x / 25 / 20) * 20) # round to multiple of 20, use max(20, ...) to avoid rounding to zero
bins = np.arange(0, max_x + bin_width, bin_width)
axes = df.hist(column='time_on_page', bins=bins, grid=False, figsize=(12, 8), color='#86bf91', rwidth=0.9)
ax = axes[0, 0]
total = len(df)
ax.set_xticks(bins)
for p in ax.patches:
h = p.get_height()
if h > 0:
ax.text(p.get_x() + p.get_width() / 2, h, f'{h / total * 100.0 :.0f} %\n', ha='center', va='center')
ax.grid(True, axis='y', ls=':', alpha=0.4)
ax.set_axisbelow(True)
for dir in ['left', 'right', 'top']:
ax.spines[dir].set_visible(False)
ax.tick_params(axis="y", length=0) # Switch off y ticks
ax.margins(x=0.02) # tighter x margins
plt.show()
I am trying to show the frequency of my data throughout the hours of the day, using a histogram, in 3 hour intervals. I therefore use 8 bins.
plt.style.use('seaborn-colorblind')
plt.figure(figsize=(10,5))
plt.hist(comments19['comment_hour'], bins = 8, alpha = 1, align='mid', edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xticks([0,3,6,9,12,15,18,21,24])
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
However, the ticks are not aligning with the bin edges, as seen from the image below.
You can do either:
plt.figure(figsize=(10,5))
# define the bin and pass to plt.hist
bins = [0,3,6,9,12,15,18,21,24]
plt.hist(comments19['comment_hour'], bins = bins, alpha = 1, align='mid',
# remove this line
# plt.xticks([0,3,6,9,12,15,18,21,24])
edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Or:
fig, ax = plt.subplots()
bins = np.arange(0,25,3)
comments19['comment_hour'].plot.hist(ax=ax,bins=bins)
# other plt format
If you set bins=8, seaborn will set 9 evenly spread boundaries, from the lowest value in the input array (0) to the highest (23), so at [0.0, 2.875, 5.75, 8.625, 11.5, 14.375, 17.25, 20.125, 23.0]. To get the 9 boundaries at 0, 3, 6, ... you need to set them explicitly.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('seaborn-colorblind')
comments19 = pd.DataFrame({'comment_hour': np.random.randint(0, 24, 100)})
plt.figure(figsize=(10, 5))
plt.hist(comments19['comment_hour'], bins=np.arange(0, 25, 3), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.title('2019 comments, 8 bins')
plt.xticks(np.arange(0, 25, 3))
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Note that your density=True means that the total area of the histogram is 1. As each bin is 3 hours wide, the sum of all the bin heights will be 0.33 and not 1.00 as you might expect. To really get a y-axis with relative frequencies, you could make the internal bin widths 1 by dividing the hours by 3. Afterwards you can relabel the x-axis back to hours.
So, following changes could be made for all the bins to sum to 100 %:
from matplotlib.ticker import PercentFormatter
plt.hist(comments19['comment_hour'] / 3, bins=np.arange(9), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.xticks(np.arange(9), np.arange(0, 25, 3))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
Is there a way of specifying the position of axis labels?
labelpad sets the space between tick labels and the axis label.
Since the width of tick labels is unknown it appears to thus be impossible to precisely position axis labels.
Here is a MWE where I would like to have the ylabels of both subplots to be vertically aligned:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
np.random.seed(19680801)
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
fig, axs = plt.subplots(2,1)
for ax in axs:
n, bins, patches = ax.hist(x, 50, normed=1, facecolor='g', alpha=0.75)
ax.set_ylabel('Probability $y$')
ax.grid(True)
ax.set_yticklabels([ r'\$\num{{{:g}}}\$'.format(item) for item in ax.get_yticks().tolist() ])
fig.show()
I tried this, but it does not work:
fig.canvas.draw()
ylabelposition = ax.yaxis.label.get_position()
ax.set_yticklabels([ r'\$\num{{{:g}}}\$'.format(item) for item in ax.get_yticks().tolist() ])
ax.yaxis.label.set_position(ylabelposition)
I have this code
bins = [0,1,10,20,30,40,50,75,100]
plt.figure(figsize=(15,15))
plt.hist(df.v1, bins = bins)
My problem is that the bin widths as they appear in the figure are proportional to their range in bins. However, I want all bins to come out having the same width.
I'm not sure how you'll make sense of the result, but you can use numpy.histogram to calculate the height of your bars, then plot those directly against an arbitrary x-scale.
x = np.random.normal(loc=50, scale=200, size=(2000,))
bins = [0,1,10,20,30,40,50,75,100]
fig = plt.figure()
ax = fig.add_subplot(211)
ax.hist(x, bins=bins, edgecolor='k')
ax = fig.add_subplot(212)
h,e = np.histogram(x, bins=bins)
ax.bar(range(len(bins)-1),h, width=1, edgecolor='k')