Matplotlib: incorrect histograms - python

So I just learned about histograms on Khan Academy:
When I go plot something similar in Matplotlib, it is plotted differently. Why?
Shouldn't bins be completely filled? And since bin 5-6 has 3 counts (5, 6, 6), shouldn't it consists of a single bar of value 3? I'm confused

By default, plt.hist() creates 10 bins (or 11 edges). The default value is found in the documentation, and is taken from you rc parameter rcParams["hist.bins"] = 10.
So if you provide data in the range [1–6], hist will count the number of values in the bins: [1.–1.5), [1.5–2.), [2–2.5), [2.5–3.), [3–3.5), [3.5–4.), [4–4.5), [4.5–5.), [5.–5.5), [5.5–6.]. You can tell that that's the case by looking at the text output by hist() (in addition to the graph).
hist() returns 3 objects when called:
the height of each bar (that is the number of items in each bin), equivalent to the column "#" in that Khan video
the edges of the bins, which is roughly equivalent to the column "Bucket" in the video
a list of matplotlib objects that you can use to tweak their appearance when needed.
In summary:
If you want to have bars of width 1, then you need to specify either the number of bins (5), or the edges of your bins.
These two calls provide the same result:
plt.hist(counts, bins=5)
plt.hist(counts, bins=[1,2,3,4,5,6])
EDIT
Here is a function that can help you see the "buckets" chosen by hist:
def hist_and_bins(x, ax=None, **kwargs):
ax = ax or plt.gca()
counts, edges, patches = ax.hist(x, **kwargs)
bin_edges = [[a,b] for a,b in zip(edges, edges[1:])]
ticks = np.mean(bin_edges, axis=1)
tick_labels = ['[{}-{})'.format(l,r) for l,r in bin_edges]
tick_labels[-1] = tick_labels[-1][:-1]+']' # last bin is a closed interval
ax.set_xticks(ticks)
ax.set_xticklabels(tick_labels)
return counts, edges, patches, ax.get_xticks()
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(9,3))
ax1.hist([1,2,3,4,5,6,6])
hist_and_bins([1,2,3,4,5,6,6], ax=ax2)
hist_and_bins([1,2,3,4,5,6,6], ax=ax3, bins=5, ec='w')
fig.autofmt_xdate()

Related

Plotting with sns.catplot gives bad graphs

I am trying to plot my data where it shows my predicted values superimposed with the actual data values. It does the job but the bar that represents the y value become ridiculously small and uninterpretable and the x-axis labels only show at the bottom of the last graph.
Bit of background- the class ids are essentially subplots of different graphs with different actual and predicted values.
enter image description here
g = sns.catplot(data=plt_df,
y='Outcome',
x='DT',
kind='bar',
ci=None,
hue='Outcome_Type',
row='CLASS_ID',
palette=sns.color_palette(['red', 'blue']),
height = 10,
aspect = 3.5)
g.fig.subplots_adjust(hspace=1)
fig, ax = plt.subplots(figsize=(20, 9))
g.fig.suptitle("Distribution Plot Comparing Actual and Predicted Visits given caliberated Betas - " + describe_plot)
g.set_xlabels('Drive Time (Mins')
g.set_ylabels('Visits Percentage')
plt.xticks(rotation= 90)
plt.show()

seaborn distplot different bar width on each figure

Sorry for giving an image however I think it is the best way to show my problem.
As you can see all of the bin width are different, from my understanding it shows range of rent_hours. I am not sure why different figure have different bin width even though I didn't set any.
My code looks is as follows:
figure, axes = plt.subplots(nrows=4, ncols=3)
figure.set_size_inches(18,14)
plt.subplots_adjust(hspace=0.5)
for ax, age_g in zip(axes.ravel(), age_cat):
group = total_usage_df.loc[(total_usage_df.age_group == age_g) & (total_usage_df.day_of_week <= 4)]
sns.distplot(group.rent_hour, ax=ax, kde=False)
ax.set(title=age_g)
ax.set_xlim([0, 24])
figure.suptitle("Weekday usage pattern", size=25);
additional question:
Seaborn : How to get the count in y axis for distplot using PairGrid for here it says that kde=False makes y-axis count however http://seaborn.pydata.org/generated/seaborn.distplot.html in the doc, it uses kde=False and still seems to show something else. How can I set y-axis to show count?
I've tried
sns.distplot(group.rent_hour, ax=ax, norm_hist=True) and it still seems to give something else rather than count.
sns.distplot(group.rent_hour, ax=ax, kde=False) gives me count however I don't know why it is giving me count.
Answer 1:
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count.
This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
Answer 2:
# Plotting hist without kde
ax = sns.distplot(your_data, kde=False)
# Creating another Y axis
second_ax = ax.twinx()
#Plotting kde without hist on the second Y axis
sns.distplot(your_data, ax=second_ax, kde=True, hist=False)
#Removing Y ticks from the second axis
second_ax.set_yticks([])

Excluding a certain range of bins in a matplotlib histogram?

I'm using matplotlib to look at how wins are distributed based on betting odds for the MLB. The issue is that because betting odds are either >= 100 or <= -100, there's a big gap in the middle of my histogram.
Is there any way to exclude certain bins (specifically anything between -100 and 100) so that the bars of the chart flow more smoothly?
Link to current histogram
Here's the code I have right now:
num_bins = 20
fig, ax = plt.subplots()
n, bins, patches = ax.hist(winner_odds_df['WinnerOdds'], num_bins,
range=range_of_winner_odds)
ax.set_xlabel('Betting Odds')
ax.set_ylabel('Win Frequency')
ax.set_title('Histogram of Favorite Win Frequency Based on Betting Odds (2018)')
fig.tight_layout()
plt.show()
You could break your chart's x-axis as explained here, by plotting on two different axes that are made to visually look like one plot. The essential part, rewritten to apply to the x-axis instead of the y-axis, is:
f, (axl, axr) = plt.subplots(1, 2, sharey=True)
# plot the same data on both axes
axl.hist(winner_odds_df['WinnerOdds'], num_bins)
axr.hist(winner_odds_df['WinnerOdds'], num_bins)
# zoom-in / limit the view to different portions of the data
axl.set_xlim(-500, -100) # outliers only
axr.set_xlim(100, 500) # most of the data
# hide the spines between axl and axr
axl.spines['right'].set_visible(False)
axr.spines['left'].set_visible(False)
axr.yaxis.tick_right()
# How much space to leave between plots
plt.subplots_adjust(wspace=0.15)
See the linked document for how to polish this by adding diagonal break lines. The basic version produced by the code above then looks like this:

Bar heights and widths in histogram plot of several data

I'm trying to plot a simple histogram with multiple data in parallel.
My data are a set of 2D ndarrays, all of them with the same dimension (in this example 256 x 256).
I have this method to plot the data set:
def plot_data_histograms(data, bins, color, label, file_path):
"""
Plot multiple data histograms in parallel
:param data : a set of data to be plotted
:param bins : the number of bins to be used
:param color : teh color of each data in the set
:param label : the label of each color in the set
:param file_path : the path where the output will be save
"""
plt.figure()
plt.hist(data, bins, normed=1, color=color, label=label, alpha=0.75)
plt.legend(loc='upper right')
plt.savefig(file_path + '.png')
plt.close()
And I'm passing my data as follows:
data = [sobel.flatten(), prewitt.flatten(), roberts.flatten(), scharr.flatten()]
labels = ['Sobel', 'Prewitt', 'Roberts Cross', 'Scharr']
colors = ['green', 'blue', 'yellow', 'red']
plot_data_histograms(data, 5, colors, labels, '../Visualizations/StatisticalMeasures/RMSEHistograms')
And I got this histogram:
I know that this may be stupid, but I didn't get why my yticks varies from 0 to 4.5. I know that is due the normed parameter, but even reading this;
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)*dbin). In a
probability density, the integral of the histogram should be 1; you
can verify that with a trapezoidal integration of the probability
density function.
I didn't really get how it works.
Also, once I set my bins to be equal five and the histogram has exactly 5 xticks (excluding borders), I didn't understand why I have some bars in the middle of some thicks, like the yellow one over the 0.6 thick. Since my number of bins and of xticks matches, I though that each set of four bars should be concentrated inside each interval, like it happens with the four first bars, completely concentrated inside the [0.0, 0.2] interval.
Thank you in advance.
The reason this is confusing is because you're squishing four histograms on one plot. In order to do this, matplotlib chooses to narrow the bars and put a gap between them. In a standard histogram, the total area of all bins is either 1 if normed or N. Here's a simple example:
a = np.random.rand(10)
bins = np.array([0, 0.5, 1.0]) # just two bins
plt.hist(a, bins, normed=True)
First note that the each bar covers the entire range of its bin: The first bar ranges from 0 to 0.5, and its height is given by the number of points in that range.
Next, you can see that the total area of the two bars is 1 because normed = True: The width of each bar is 0.5 and the heights are 1.2 and 0.8.
Let's plot the same thing again with another distribution so you can see the effect:
b = np.random.rand(10)
plt.hist([a, b], bins, normed=True)
Recall that the blue bars represent exactly the same data as in the first plot, but they're less than half the width now because they must make room for the green bars. You can see that now two bars plus some whitespace covers the range of each bin. So we must pretend that the width of each bar is actually the width of all bars plus the width of the whitespace gap when we are calculating the bin range and bar area.
Finally, notice that nowhere do the xticks align with the binedges. If you wish, you can set this to be the case manually, with:
plt.xticks(bins)
If you hadn't manually created bins first, you can grab it from plt.hist:
counts, bins, bars = plt.hist(...)
plt.xticks(bins)

Setting axis properties on axis-sharing plots

Say I have a plot with several axis-sharing subplots, such as the one below. How can I control where the x_ticks go in the x-axis shared by all the subplots?
For example, say I want to display the ticks only on the following values of X: 0, 50 and 100. As far as I understand, for the method ax.set_xticks I need to specify an axis, but they all share one, how do I get its handle?
f, axes = plt.subplots(3, sharex=True, sharey=True)
for ix in xrange(3):
ax = axes[ix]
t = np.arange(0.0, 100.0, 0.1)
s = np.sin(0.1*np.pi*t)*np.exp(-t*0.01)
ax.plot(t,s)
Update:
How can I also have a ylabel for all my subplots that is centered vertically?
Using plt.setp:
plt.setp(axes[-1], xticks=[5,10,45])
FYI, more information here:
http://matplotlib.org/examples/pylab_examples/shared_axis_demo.html

Categories

Resources