Seaborn probability histplot - KDE normalization - python

When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:
"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."
Below is the example of density histplot with default KDE normalized to 1.
However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:
How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!

Well, the kde has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.
For a density plot, the histogram has an area of 1, so the kde can be used as-is.
For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).
For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.
Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np
data = [115, 127, 128, 145, 160]
bin_values, bin_edges = np.histogram(data, bins=4)
bin_width = bin_edges[1] - bin_edges[0]
total_area = bin_width * len(data)
kde = gaussian_kde(data)
x = np.linspace(bin_edges[0], bin_edges[-1], 200)
fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
axs[0].plot(x, kde(x), color='dodgerblue')
axs[0].set_ylabel('density')
axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
axs[1].set_ylabel('probability')
axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
axs[2].set_ylabel('count')
plt.tight_layout()
plt.show()

As far as I understand it, the KDE (kernel density estimation) is simply smoothing the curve formed from the data points. What changes between the three representations is the values from which it is computed :
With density estimation, the total area under the KDE curve is 1 ; which means you can estimate the probability of finding a value between two bounding values with an integral computation. I think they smooth the data points with a curve, compute the area under the curve and divide all the values by the area so that the curve keeps the same look but the area becomes 1.
With probability estimation, the total area under the KDE curve does not matter : each category has a certain probability (e.g. P(x in [115; 125]) = 0.2) and the sum of the probabilities for each category is equal to 1. So instead of computing the area under the KDE curve, they would count all the samples and divide each bin's count by the total.
With the counting estimation, you get a standard bin/count distribution and the KDE is just smoothing the numbers so that you can estimate the distribution of values - so that you can estimate how your observations might look like if you take more measures or use more bins.
So all in all, the KDE curve stays the same : it is a smoothing of the sample data distribution. But there is a factor that is applied on the sample values based on what representation of the data you are interested in.
However, take what I am writing with a grain of salt : I think I am not far from the truth, from a mathematical point of view, but maybe someone could explain it with more precise terms - or correct me if I'm wrong.
Here is some reading about Kerneld density estimation : https://en.wikipedia.org/wiki/Kernel_density_estimation ; but for short, this is a smoothing method with some special methematical properties depending on the parameters used.

Related

Histograms intersection point - Python

I have two histograms generated with
plt.figure(figsize=(30,10))
sns.set()
sns.distplot(A, hist=True, kde=True, bins=50, color = 'darkgrey', hist_kws={'edgecolor':'white'}, kde_kws={'linewidth': 3})
sns.distplot(B, hist=True, kde=True, bins=50, color = 'lightgreen', hist_kws={'edgecolor':'white'}, kde_kws={'linewidth': 3})
plt.xlabel("X")
plt.ylabel("Frequency")
plt.title("A vs B")
Now I need to find the coordinates of the intersection point between the two histograms, any idea about how can I do that?
histograms
Histograms don't really have "intersections" per se because they are discrete binned distributions. However, you could approximate each histogram with a continuous probability distribution (this would require you to know the distribution type of the data though, e.g. normal, lognormal). With continuous probability distribution functions (pdfs), you could then set these pdfs equal to one another and solve for the intersection.
The discrete analog would be to identify the bin or bins at which the two distributions intersect (i.e. histogram A is greater than histogram B in bin 10-20, and less than histogram B in bin 20-30, so the "actual" continuous pdfs intersect somewhere within the range 10-30. Then, you could use some form of interpolation to estimate the exact intersection. The easiest but not necessarily the most accurate interpolation strategy would be linear interpolation.

Normed histogram y-axis larger than 1

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Seaborn Distplot with Density on y-axis

I have looked for solutions already but could not find one that worked for my problem. I am trying to plot a histogram with a density function showing the density on the y-axis. meanopa are average logreturns of the S&P500.
What I do not understand is the following.
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count.
This is implied if a KDE or fitted density is plotted.
Since kde=True in my case, I am wondering why there is the number of observations on the y-axis.
sns.distplot(meanopa, hist=True, kde=True, bins=20, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
Thanks in advance and again I would appreciate any sort of support.
Cheers!
Your result is ok. The y-axis is not showing the values of the histogram, but for the probability density (actually the kernel density estimate). Since your numbers are very small, the x-axis has also a very narrow interval... actually from your plot if you build a square of 0.002 x 500 to approximate the total area under the curve, the result of the full probability density is around 1, as expected.
As a note, this is a reproducible version of your problem, you can play with the rescaling (min_rescale and max_rescale values) if you want to see how the shape of the probability density changes.
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=True, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
In case you are not interested in the probability density function but in the probabilities/frequencies of each bin which is given by the count of samples in the bin divided by the total number of samples, you can use the 'weights' attribute of the hist_kws parameter. Applying this to the example code of lrnzcig,
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=False, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black', 'weights': np.ones(len(close2))/len(close2)})
results in the following plot:
probabilities of Histogram bins using seaborn's distplot
Note that the result is no probability density function, instead the weights of the bins sum up to 1 independent from the argument values of the bins.
However, this makes no sense when you are performing kde.

What is y axis in seaborn distplot?

I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()

Why doesn't the `normed` parameter for matplotlib histograms do anything?

I'm confused by the normed argument from matplotlib.pyplot.hist and why it does not change the plot output:
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)'dbin), i.e.,
the integral of the histogram will sum to 1. If stacked is also True,
the sum of the histograms is normalized to 1.
Default is False
Seems pretty clear. I've seen it called a density function, probability density, etc.
That is, given a random uniform distribution of size 1000 in [0, 10]:
Specifying normed=True should change the y-axis to a density axis, where the sum of the bars is 1.0:
But in reality it does nothing of the sort:
r = np.random.uniform(size=1000)
plt.hist(r, normed=True)
And furthermore:
print(plt.hist(r, normed=True)[0].sum())
# definitely not 1.0
10.012123595
So, I have seen #Carsten König's answers to similar questions and am not asking for a workaround. My question is, what then is the purpose of normed? Am I misinterpreting what this parameter actually does?
The matplotlib documentation even gives an example named "histogram_percent_demo", where the integral looks like it would be over a thousand percent.
The height of the bars do not necessarily sum to one.
It is the area under the curve, which is the same as the integral of the histogram, which equals one:
import numpy as np
import matplotlib.pyplot as plt
r = np.random.uniform(size=1000)
hist, bins, patches = plt.hist(r, normed=True)
print((hist * np.diff(bins)).sum())
# 1.0
norm=True thus returns a histogram which can be interpreted as a probability distribution.
According to matplotlib version 3.0.2,
normed : bool, optional
Deprecated; use the density keyword argument instead.
So if you want density plot, use density=True instead.
Or you can use seaborn.displot, which plots histogram by default using density rather than frequency.
What normed =True does is to scale area under the curve to be 1, as #unutbu has shown.
density=True keeps the same property (area under curve sums to 1) and is more meaningful and useful.
r = np.random.uniform(size=1000)
hist, bins, patches = plt.hist(r, density=True)
print((hist * np.diff(bins)).sum())
[Out] 1

Categories

Resources