I have two histograms generated with
plt.figure(figsize=(30,10))
sns.set()
sns.distplot(A, hist=True, kde=True, bins=50, color = 'darkgrey', hist_kws={'edgecolor':'white'}, kde_kws={'linewidth': 3})
sns.distplot(B, hist=True, kde=True, bins=50, color = 'lightgreen', hist_kws={'edgecolor':'white'}, kde_kws={'linewidth': 3})
plt.xlabel("X")
plt.ylabel("Frequency")
plt.title("A vs B")
Now I need to find the coordinates of the intersection point between the two histograms, any idea about how can I do that?
histograms
Histograms don't really have "intersections" per se because they are discrete binned distributions. However, you could approximate each histogram with a continuous probability distribution (this would require you to know the distribution type of the data though, e.g. normal, lognormal). With continuous probability distribution functions (pdfs), you could then set these pdfs equal to one another and solve for the intersection.
The discrete analog would be to identify the bin or bins at which the two distributions intersect (i.e. histogram A is greater than histogram B in bin 10-20, and less than histogram B in bin 20-30, so the "actual" continuous pdfs intersect somewhere within the range 10-30. Then, you could use some form of interpolation to estimate the exact intersection. The easiest but not necessarily the most accurate interpolation strategy would be linear interpolation.
Related
When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:
"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."
Below is the example of density histplot with default KDE normalized to 1.
However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:
How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!
Well, the kde has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.
For a density plot, the histogram has an area of 1, so the kde can be used as-is.
For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).
For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.
Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np
data = [115, 127, 128, 145, 160]
bin_values, bin_edges = np.histogram(data, bins=4)
bin_width = bin_edges[1] - bin_edges[0]
total_area = bin_width * len(data)
kde = gaussian_kde(data)
x = np.linspace(bin_edges[0], bin_edges[-1], 200)
fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
axs[0].plot(x, kde(x), color='dodgerblue')
axs[0].set_ylabel('density')
axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
axs[1].set_ylabel('probability')
axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
axs[2].set_ylabel('count')
plt.tight_layout()
plt.show()
As far as I understand it, the KDE (kernel density estimation) is simply smoothing the curve formed from the data points. What changes between the three representations is the values from which it is computed :
With density estimation, the total area under the KDE curve is 1 ; which means you can estimate the probability of finding a value between two bounding values with an integral computation. I think they smooth the data points with a curve, compute the area under the curve and divide all the values by the area so that the curve keeps the same look but the area becomes 1.
With probability estimation, the total area under the KDE curve does not matter : each category has a certain probability (e.g. P(x in [115; 125]) = 0.2) and the sum of the probabilities for each category is equal to 1. So instead of computing the area under the KDE curve, they would count all the samples and divide each bin's count by the total.
With the counting estimation, you get a standard bin/count distribution and the KDE is just smoothing the numbers so that you can estimate the distribution of values - so that you can estimate how your observations might look like if you take more measures or use more bins.
So all in all, the KDE curve stays the same : it is a smoothing of the sample data distribution. But there is a factor that is applied on the sample values based on what representation of the data you are interested in.
However, take what I am writing with a grain of salt : I think I am not far from the truth, from a mathematical point of view, but maybe someone could explain it with more precise terms - or correct me if I'm wrong.
Here is some reading about Kerneld density estimation : https://en.wikipedia.org/wiki/Kernel_density_estimation ; but for short, this is a smoothing method with some special methematical properties depending on the parameters used.
I am plotting density map of ~40k points but hist2d returns a uniform density map. This is my code
hist2d(x, y, bins=(1000, 1000), cmap=plt.cm.jet)
Here is the scatter plot
Here is the histogram
I was expecting that there is a red horizontal portion in the center and the gradually turns blue towards higher/lower y values
EDIT:
#bb1 suggested decrease the number of bins but by setting it to bins=(100, 1000), I get this result
I think you are specifying too many bins. By setting bins=(1000,000) you get 1,000,000 bins. With 40,000 points, most of the bins will be empty and they overwhelm the image.
You may also consider using seaborn kdeplot() function instead of plt.hist2d(). It will visualize the density of data without subdividing data into bins:
import seaborn as sns
sns.kdeplot(x=x, y=y, levels = 100, fill=True, cmap="mako", thresh=0)
I am measuring the accuracy of a machine learning classifier which has two parameters. I would like to have the x and y axes represent these two parameters, and have the z index (the contour / depth) show the accuracy of the model.
The problem i'm having is that seaborn's kdeplot seems to be calculating the z index based on where the points are in the graph; It doesn't show the accuracy, but rather the concentration of points.
Is there a way to use the accuracy (the score of these points) to show the depth of the graph?
Or maybe this isn't the best way to represent this kind of information?
sns.jointplot(x="n_estimators", y="learning_rate",
data=data, height=8, ratio=10, space=0,
color="#383C65")\
.plot_joint(sns.kdeplot, zorder=0, shade=True, shade_lowest=False,
cmap=sns.cubehelix_palette(light=1, as_cmap=True), legend=True, cbar=False,
cbar_kws={})
Where data is a pandas Dataframe with three columns: learning_rate, n_estimators, accuracy
I have also used matplotlib's contourf to the same results. Would really appreciate any help. Thanks!
In my program, I calculate N amounts of three parameters and want to create three histograms for each parameter. I have strict conditions for the histograms. Firstly, it is a condition on the range (at some points histogram should go to zero strictly), and it should be smooth.
I use np.histogram, as following:
hist, bins = np.histogram(Gamm1, bins=100)
center = bins[:-1]
plt.plot(center, hist)
plt.show()
but the solution is too sharp. After that, I use the following construction with seaborn,
snsplot = sns.kdeplot(data['Third'], shade=True)
fig = snsplot.get_figure()
fig.savefig("output2.png")
but here approximation goes out of range (range is created from physical conditions).
I think that changing in bins for a seaborn solution, as it could be done for np.histogram, can help.
But, in the end, I'm looking for some solution, which will be smooth and into given by me range.
I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()