I visualize density function (PDF) using two plotting approaches: displot() and plot(). I don't understand why displot() doesn't produce normally distributed plot wheras plot() do this perfectly. Density plots should look alike but they don't. What's wrong with displot() here?
from scipy.stats import norm
import seaborn as sns
import numpy as np
data_x= np.arange(-4, 4, 0.001)
norm_pdf = norm.pdf(data_x)
sns.displot(data = norm_pdf, x = data_x, kind='kde')
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
data_x= np.arange(-4, 4, 0.001)
plt.plot(data_x, norm.pdf(data_x))
plt.show()
displot (or the underlying kdeplot) creates an approximation of a probability density function (pdf) to resemble the function that might have generated the given random data. As input, you'll need random data. The function will mimic these data as a sum of Gaussian bell shapes (a "kernel density estimation" with a Gaussian kernel).
Here is an example using 8000 random points as input. You'll notice the curve resembles the normal pdf, but is also a bit "bumpier" (that's how randomness looks like).
data_x = norm.rvs(size=8000)
sns.kdeplot(x=data_x)
When you call kdeplot (or displot(..., kind='kde')) with both data= and x=, while x= isn't a columnname in a dataframe, data= gets ignored. So, you are using 8000 evenly distributed values between -4 and 4. The kde of such data looks like a flat line between -4 and 4. But as the kde supposes the underlying function locally resembles a Gaussian, the start and end are smoothed out.
data_x = np.arange(-4, 4, 0.001)
sns.kdeplot(x=data_x)
Related
The pandas.plot.kde() function is handy for plotting the estimated density function of a continuous random variable. It will take data x as input, and display the probabilities p(x) of the binned input as its output.
How can I extract the values of probabilities it computes? Instead of just plotting the probabilities of bandwidthed samples, I would like an array or pandas series that contains the probability values it internally computed.
If this can't be done with pandas kde, let me know of any equivalent in scipy or other
there are several ways to do that. You can either compute it yourself or get it from the plot.
As pointed out in the comment by #RichieV following this post, you can extract the data from the plot using
data.plot.kde().get_lines()[0].get_xydata()
Use seaborn and then the same as in 1):
You can use seaborn to estimate the kernel density and then matplotlib to extract the values (as in this post). You can either use distplot or kdeplot:
import seaborn as sns
# kde plot
x,y = sns.kdeplot(data).get_lines()[0].get_data()
# distplot
x,y = sns.distplot(data, hist=False).get_lines()[0].get_data()
You can use the underlying methods of scipy.stats.gaussian_kde to estimate the kernel density which is used by pandas:
import scipy.stats
density = scipy.stats.gaussian_kde(data)
and then you can use this to evaluate it on a set of points:
x = np.linspace(0,80,200)
y = density(xs)
When plotting the estimated density function of my data using sns.kdeplot(), the algorithm extrapolates outside of the boundaries of the data, meaning that it draws the plot for values smaller than 0 or greater than 1, which is particularly annoying when dealing with probabilities. Example:
import numpy as np
import seaborn as sns
data = np.random.random(100)
sns.kdeplot(data)
How to fix that so that the actual plot remains within [0,1] in the x-direction without simply calling plt.xlim((0,1))?
arange works on stepwise incrementing values and is not random function then why does it give a random distribution?
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))
I expect a uniform distribution
The library scipy.stats.norm provides functionality of Normal Distribution, not Uniform distribution. Meaning when you apply the probability density function (pdf), you are not applying a constant function, but something else entirely (also knowns as the Bell curve):
https://en.wikipedia.org/wiki/Normal_distribution
So in the end what you are seeing are points between (-3, 3) visualised on the probability density function of Normal distribution. If you want to see Uniform distribution:
from scipy.stats import uniform
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, uniform.pdf(x))
But that is just a very fancy way to draw a constant line.
I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]
How can I make a figure like the following one but with flat curve using matlibplot in Python?
Instead of using a histogram to bin your data have a look at using a KDE for a continuous estimate of the probability distribution. There is an implementation using a gaussian kernel in scipy.stats.gaussian_kde.
As an example:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
data = np.random.normal(0.0, 1.0, 10000) #Generate some data
kde = gaussian_kde(data)
xplot = np.linspace(-5,5,1000)
plt.plot( xplot, kde(xplot), label='KDE' )
plt.hist( data, bins=50, histtype='step', normed=True, label='histogram' )
plt.legend()
plt.show()
Will produce the plot:
Note that when using KDEs the bandwidth of the kernel that you choose can have a very big impact on the representation of the data that gets produced, this is similar to the effect that the bin size would have when making a histogram. Both the scipy documentation that I linked to and the wikipedia page have good writeups on how to make this selection in a well motivated way.