Why density histogram shows a bit weird values on y-axis?

Why density histogram shows a bit weird values on y-axis? - python

A have a dataframe with values:
user value
1 0
2 1
3 4
4 2
5 1
When I'm trying to plot a histogram with density=True it shows pretty weird result
df.plot(kind='hist', denisty=True)
I know excatly that first bin covers almost 100% of the values. And density in this case should be more than 0.8. But plot shows something about 0.04.
How could that happen? Maybe I get the meaning of density wrong.
By the way there are abou 800 000 values in dataframe in case it's related. Here is a describe of the dataframe:
count 795846.000000
mean 5.220350
std 20.600285
min -3.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 247.000000

If you are interested in probability and not probability density I think you want to use weights instead of density. Take a look at this example to see the difference:
df = pd.DataFrame({'x':np.random.normal(loc=5, scale=10, size=80000)})
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))
df.plot(kind='hist', density=True, bins=np.linspace(-100, 100, 30), ax=ax0)
df.plot(kind='hist', bins=np.linspace(-100, 100, 30), weights=np.ones(len(df))/len(df), ax=ax1)
If you use density you normalize by the area of the plot, instead, if you use weights, you normalize by the sum of the heights of the bins.

You understood the meaning of density wrong. Refer to the documentation of numpy histogram (couldn't find the exact pandas one but is the same mechanism)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
"Density ... If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1"
This means that the sum of the histogram areas is one, not the sum of the heights. In particular you will get the probability to be in a bin by multiplying the height by the width of the bin.

Related

Seaborn probability histplot - KDE normalization

When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:
"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."
Below is the example of density histplot with default KDE normalized to 1.
However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:
How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!

Well, the kde has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.
For a density plot, the histogram has an area of 1, so the kde can be used as-is.
For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).
For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.
Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np
data = [115, 127, 128, 145, 160]
bin_values, bin_edges = np.histogram(data, bins=4)
bin_width = bin_edges[1] - bin_edges[0]
total_area = bin_width * len(data)
kde = gaussian_kde(data)
x = np.linspace(bin_edges[0], bin_edges[-1], 200)
fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
axs[0].plot(x, kde(x), color='dodgerblue')
axs[0].set_ylabel('density')
axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
axs[1].set_ylabel('probability')
axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
axs[2].set_ylabel('count')
plt.tight_layout()
plt.show()

As far as I understand it, the KDE (kernel density estimation) is simply smoothing the curve formed from the data points. What changes between the three representations is the values from which it is computed :
With density estimation, the total area under the KDE curve is 1 ; which means you can estimate the probability of finding a value between two bounding values with an integral computation. I think they smooth the data points with a curve, compute the area under the curve and divide all the values by the area so that the curve keeps the same look but the area becomes 1.
With probability estimation, the total area under the KDE curve does not matter : each category has a certain probability (e.g. P(x in [115; 125]) = 0.2) and the sum of the probabilities for each category is equal to 1. So instead of computing the area under the KDE curve, they would count all the samples and divide each bin's count by the total.
With the counting estimation, you get a standard bin/count distribution and the KDE is just smoothing the numbers so that you can estimate the distribution of values - so that you can estimate how your observations might look like if you take more measures or use more bins.
So all in all, the KDE curve stays the same : it is a smoothing of the sample data distribution. But there is a factor that is applied on the sample values based on what representation of the data you are interested in.
However, take what I am writing with a grain of salt : I think I am not far from the truth, from a mathematical point of view, but maybe someone could explain it with more precise terms - or correct me if I'm wrong.
Here is some reading about Kerneld density estimation : https://en.wikipedia.org/wiki/Kernel_density_estimation ; but for short, this is a smoothing method with some special methematical properties depending on the parameters used.

Normed histogram y-axis larger than 1

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?

The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Cumulative distribution function in numpy not reaching 1?

I am trying to plot a CDF over a histogram using matplotlib with the following code:
values, base = np.histogram(df['0'], bins=50)
cumulative = np.cumsum(values) / df['0'].sum()
# plot the cumulative function
plt.hist(df['0'], bins=50, density=True)
plt.plot(base[:-1], cumulative, c='blue')
plt.show()
However my plot ends up looking like this, where the CDF looks like it is nearing .007 or thereabouts, when I would expect it to reach 1:
I'm not sure what I'm doing wrong, but I'd appreciate any help

I think the problem is that you are normalizing the cumulative sum of the bins with the sum of the values in your dataframe. The quantity stored in values is the number of occurrence of df['0'] inside the corresponding bin.
If you want to show the cumulative sum of the bins you need to normalize it to the total number of elements of df['0']:
cumulative = np.cumsum(values)/df['0'].values.shape[0]

Seaborn Distplot with Density on y-axis

I have looked for solutions already but could not find one that worked for my problem. I am trying to plot a histogram with a density function showing the density on the y-axis. meanopa are average logreturns of the S&P500.
What I do not understand is the following.
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count.
This is implied if a KDE or fitted density is plotted.
Since kde=True in my case, I am wondering why there is the number of observations on the y-axis.
sns.distplot(meanopa, hist=True, kde=True, bins=20, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
Thanks in advance and again I would appreciate any sort of support.
Cheers!

Your result is ok. The y-axis is not showing the values of the histogram, but for the probability density (actually the kernel density estimate). Since your numbers are very small, the x-axis has also a very narrow interval... actually from your plot if you build a square of 0.002 x 500 to approximate the total area under the curve, the result of the full probability density is around 1, as expected.
As a note, this is a reproducible version of your problem, you can play with the rescaling (min_rescale and max_rescale values) if you want to see how the shape of the probability density changes.
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=True, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})

In case you are not interested in the probability density function but in the probabilities/frequencies of each bin which is given by the count of samples in the bin divided by the total number of samples, you can use the 'weights' attribute of the hist_kws parameter. Applying this to the example code of lrnzcig,
random.seed(2)
min_rescale = -0.001
max_rescale = 0.001
close2 = [min_rescale + random.random() * (max_rescale - min_rescale) for x in range(100)]
sns.distplot(close2, hist=True, kde=False, bins=5, color = 'darkblue',
hist_kws={'edgecolor':'black', 'weights': np.ones(len(close2))/len(close2)})
results in the following plot:
probabilities of Histogram bins using seaborn's distplot
Note that the result is no probability density function, instead the weights of the bins sum up to 1 independent from the argument values of the bins.
However, this makes no sense when you are performing kde.

What is y axis in seaborn distplot?

I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?

From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.

The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0

This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.