Normed histogram y-axis larger than 1 - python

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?

The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Related

Frequencies of values in histogram

This is my first post, so please bear with me
Here is the code
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=100)
ax1.hist(sample,bins=100)
ax1.set_title('n={}'.format(sample_size))
print(len(np.unique(sample))) ##outputs 100 as expected
My doubt is if I am giving bins=100 and the number of samples is also 100, so why it doesn't plot bars for every single sample and why the output plot contains frequencies greater than 1?
With default parameters, all bins get the same width. 100 bins means the width of each bin is 1/100th of the total width. The total width goes from smallest to the largest of the list of samples.
Due to the choice of boundaries, at least one point will end up in the first bin, one in the last bin, but most will end up in the central bins and many of the outermost bins stay empty.
Having all bins the same width often is desired. A histogram wants to show in which region there are more and where there are less samples, whether there is just one peak or multiple peaks. Generally, to convey interesting information about data, the number of bins should be much less than the number of samples.
Here is a plot to illustrate what's happening. As 100 bins create a very crowded plot, the example uses just 20 samples and 20 bins. With so few samples, they will be spread out a bit more than with more samples. hist returns 3 arrays: one with the contents of each bin, one with the boundaries between the bins (this is one more than the number of bins) and one with the graphical objects (rectangular patches). The boundaries can be used to show their position.
import matplotlib.pyplot as plt
import numpy as np
N = 20
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=N)
bin_values, bin_bounds, _ = ax1.hist(sample, bins=N, label='Histogram')
ax1.set_title(f'{len(np.unique(sample))} samples')
ax1.plot(np.repeat(bin_bounds, 3), np.tile([0, -1, np.nan], len(bin_bounds)), label='Bin boundaries' )
ax1.scatter(sample, np.full_like(sample, -0.5), facecolor='none', edgecolor='crimson', label='Sample values')
ax1.axhline(0, color='black')
plt.legend()
plt.show()
Here is how 100 samples and 100 bins looks like:

Why density histogram shows a bit weird values on y-axis?

A have a dataframe with values:
user value
1 0
2 1
3 4
4 2
5 1
When I'm trying to plot a histogram with density=True it shows pretty weird result
df.plot(kind='hist', denisty=True)
I know excatly that first bin covers almost 100% of the values. And density in this case should be more than 0.8. But plot shows something about 0.04.
How could that happen? Maybe I get the meaning of density wrong.
By the way there are abou 800 000 values in dataframe in case it's related. Here is a describe of the dataframe:
count 795846.000000
mean 5.220350
std 20.600285
min -3.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 247.000000
If you are interested in probability and not probability density I think you want to use weights instead of density. Take a look at this example to see the difference:
df = pd.DataFrame({'x':np.random.normal(loc=5, scale=10, size=80000)})
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))
df.plot(kind='hist', density=True, bins=np.linspace(-100, 100, 30), ax=ax0)
df.plot(kind='hist', bins=np.linspace(-100, 100, 30), weights=np.ones(len(df))/len(df), ax=ax1)
If you use density you normalize by the area of the plot, instead, if you use weights, you normalize by the sum of the heights of the bins.
You understood the meaning of density wrong. Refer to the documentation of numpy histogram (couldn't find the exact pandas one but is the same mechanism)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
"Density ... If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1"
This means that the sum of the histogram areas is one, not the sum of the heights. In particular you will get the probability to be in a bin by multiplying the height by the width of the bin.

Seaborn Distplot with Density on y-axis

I have looked for solutions already but could not find one that worked for my problem. I am trying to plot a histogram with a density function showing the density on the y-axis. meanopa are average logreturns of the S&P500.
What I do not understand is the following.
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count.
This is implied if a KDE or fitted density is plotted.
Since kde=True in my case, I am wondering why there is the number of observations on the y-axis.
sns.distplot(meanopa, hist=True, kde=True, bins=20, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
Thanks in advance and again I would appreciate any sort of support.
Cheers!
Your result is ok. The y-axis is not showing the values of the histogram, but for the probability density (actually the kernel density estimate). Since your numbers are very small, the x-axis has also a very narrow interval... actually from your plot if you build a square of 0.002 x 500 to approximate the total area under the curve, the result of the full probability density is around 1, as expected.
As a note, this is a reproducible version of your problem, you can play with the rescaling (min_rescale and max_rescale values) if you want to see how the shape of the probability density changes.
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=True, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
In case you are not interested in the probability density function but in the probabilities/frequencies of each bin which is given by the count of samples in the bin divided by the total number of samples, you can use the 'weights' attribute of the hist_kws parameter. Applying this to the example code of lrnzcig,
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=False, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black', 'weights': np.ones(len(close2))/len(close2)})
results in the following plot:
probabilities of Histogram bins using seaborn's distplot
Note that the result is no probability density function, instead the weights of the bins sum up to 1 independent from the argument values of the bins.
However, this makes no sense when you are performing kde.

What is y axis in seaborn distplot?

I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()

Bar heights and widths in histogram plot of several data

I'm trying to plot a simple histogram with multiple data in parallel.
My data are a set of 2D ndarrays, all of them with the same dimension (in this example 256 x 256).
I have this method to plot the data set:
def plot_data_histograms(data, bins, color, label, file_path):
"""
Plot multiple data histograms in parallel
:param data : a set of data to be plotted
:param bins : the number of bins to be used
:param color : teh color of each data in the set
:param label : the label of each color in the set
:param file_path : the path where the output will be save
"""
plt.figure()
plt.hist(data, bins, normed=1, color=color, label=label, alpha=0.75)
plt.legend(loc='upper right')
plt.savefig(file_path + '.png')
plt.close()
And I'm passing my data as follows:
data = [sobel.flatten(), prewitt.flatten(), roberts.flatten(), scharr.flatten()]
labels = ['Sobel', 'Prewitt', 'Roberts Cross', 'Scharr']
colors = ['green', 'blue', 'yellow', 'red']
plot_data_histograms(data, 5, colors, labels, '../Visualizations/StatisticalMeasures/RMSEHistograms')
And I got this histogram:
I know that this may be stupid, but I didn't get why my yticks varies from 0 to 4.5. I know that is due the normed parameter, but even reading this;
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)*dbin). In a
probability density, the integral of the histogram should be 1; you
can verify that with a trapezoidal integration of the probability
density function.
I didn't really get how it works.
Also, once I set my bins to be equal five and the histogram has exactly 5 xticks (excluding borders), I didn't understand why I have some bars in the middle of some thicks, like the yellow one over the 0.6 thick. Since my number of bins and of xticks matches, I though that each set of four bars should be concentrated inside each interval, like it happens with the four first bars, completely concentrated inside the [0.0, 0.2] interval.
Thank you in advance.
The reason this is confusing is because you're squishing four histograms on one plot. In order to do this, matplotlib chooses to narrow the bars and put a gap between them. In a standard histogram, the total area of all bins is either 1 if normed or N. Here's a simple example:
a = np.random.rand(10)
bins = np.array([0, 0.5, 1.0]) # just two bins
plt.hist(a, bins, normed=True)
First note that the each bar covers the entire range of its bin: The first bar ranges from 0 to 0.5, and its height is given by the number of points in that range.
Next, you can see that the total area of the two bars is 1 because normed = True: The width of each bar is 0.5 and the heights are 1.2 and 0.8.
Let's plot the same thing again with another distribution so you can see the effect:
b = np.random.rand(10)
plt.hist([a, b], bins, normed=True)
Recall that the blue bars represent exactly the same data as in the first plot, but they're less than half the width now because they must make room for the green bars. You can see that now two bars plus some whitespace covers the range of each bin. So we must pretend that the width of each bar is actually the width of all bars plus the width of the whitespace gap when we are calculating the bin range and bar area.
Finally, notice that nowhere do the xticks align with the binedges. If you wish, you can set this to be the case manually, with:
plt.xticks(bins)
If you hadn't manually created bins first, you can grab it from plt.hist:
counts, bins, bars = plt.hist(...)
plt.xticks(bins)

Categories

Resources