Calculating probability distribution from time series data in python

Calculating probability distribution from time series data in python - python

I have a question about probability distribution function I have a time series data and I want to calculate the probability distribution of data in different time windows.
I have developed the following code but i could not find the value of probability distribution for this function.
a = pd.DataFrame([0.0,
21.660332407421638,
20.56428943581567,
20.597329924045983,
19.313207915827956,
19.104973174542806,
18.031361568112377,
17.904747973652125,
16.705687654209264,
16.534206966165637,
16.347782724271802,
13.994284547628721,
12.870120434556945,
12.794530081249571,
10.660675400742669])
this is the histogram and density plot of my data:
a.plot.hist()
a.plot.density()
but i don't know how can I calculate the value of the area under density curve.

You can directly call the method scipy.stats.gaussian_kde which is also used by pandas internally.
This method returns the desired function.
You can then call one of the methods from scipy.integrate to calculate areas under the kernel density estimate, e.g.
from scipy import stats, integrate
kde = stats.gaussian_kde(a[0])
# Calculate the integral of the kde between 10 and 20:
xmin, xmax = 10, 20
integral, err = integrate.quad(kde, xmin, xmax)
x = np.linspace(-5,20,100)
x_integral = np.linspace(xmin, xmax, 100)
plt.plot(x, kde(x), label="KDE")
plt.fill_between(x_integral, 0, kde(x_integral),
alpha=0.3, color='b', label="Area: {:.3f}".format(integral))
plt.legend()

Related

Seaborn probability histplot - KDE normalization

When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:
"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."
Below is the example of density histplot with default KDE normalized to 1.
However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:
How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!

Well, the kde has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.
For a density plot, the histogram has an area of 1, so the kde can be used as-is.
For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).
For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.
Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np
data = [115, 127, 128, 145, 160]
bin_values, bin_edges = np.histogram(data, bins=4)
bin_width = bin_edges[1] - bin_edges[0]
total_area = bin_width * len(data)
kde = gaussian_kde(data)
x = np.linspace(bin_edges[0], bin_edges[-1], 200)
fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
axs[0].plot(x, kde(x), color='dodgerblue')
axs[0].set_ylabel('density')
axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
axs[1].set_ylabel('probability')
axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
axs[2].set_ylabel('count')
plt.tight_layout()
plt.show()

As far as I understand it, the KDE (kernel density estimation) is simply smoothing the curve formed from the data points. What changes between the three representations is the values from which it is computed :
With density estimation, the total area under the KDE curve is 1 ; which means you can estimate the probability of finding a value between two bounding values with an integral computation. I think they smooth the data points with a curve, compute the area under the curve and divide all the values by the area so that the curve keeps the same look but the area becomes 1.
With probability estimation, the total area under the KDE curve does not matter : each category has a certain probability (e.g. P(x in [115; 125]) = 0.2) and the sum of the probabilities for each category is equal to 1. So instead of computing the area under the KDE curve, they would count all the samples and divide each bin's count by the total.
With the counting estimation, you get a standard bin/count distribution and the KDE is just smoothing the numbers so that you can estimate the distribution of values - so that you can estimate how your observations might look like if you take more measures or use more bins.
So all in all, the KDE curve stays the same : it is a smoothing of the sample data distribution. But there is a factor that is applied on the sample values based on what representation of the data you are interested in.
However, take what I am writing with a grain of salt : I think I am not far from the truth, from a mathematical point of view, but maybe someone could explain it with more precise terms - or correct me if I'm wrong.
Here is some reading about Kerneld density estimation : https://en.wikipedia.org/wiki/Kernel_density_estimation ; but for short, this is a smoothing method with some special methematical properties depending on the parameters used.

How to find the probability from a normal probability density function in python?

Basically, I have plotted a normal curve by using the values of mean and standard deviation. The y-axis gives the probability density.
How do I find the probability at a certain value "x" on the x-axis? Is there any Python function for it or how do I code it?

Not very sure if you mean the probability density function, which is:
given a certain mean and standard deviation. In python you can use the stats.norm.fit to get the probability, for example, we have some data where we fit a normal distribution:
from scipy import stats
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data = stats.norm.rvs(10,2,1000)
x = np.linspace(min(data),max(data),1000)
mu, var = stats.norm.fit(data)
p = stats.norm.pdf(x, mu, std)
Now we have estimated the mean and standard deviation, we use the pdf to estimate probability at for example 12.5:
xval = 12.5
p_at_x = stats.norm.pdf(xval,mu,std)
We can plot to see if it is what you want:
fig, ax = plt.subplots(1,1)
sns.distplot(data,bins=50,ax=ax)
plt.plot(x,p)
ax.hlines(p_at_x,0,xval,linestyle ="dotted")
ax.vlines(xval,0,p_at_x,linestyle ="dotted")

A scipy.stat distribution includes these 3 methods:
pdf(x) the value of the pdf at x. This is what you asked for.
cdf(x) the cumulative probability at x.
ppf(p) the inverse of the cdf(). The critical value that gives cumulative probability, p.
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# plot a normal distribution and use scipy.stats to obtain
# probabilities and critical values (percentiles).
# Using scipy.stats, this can be done for any distribution
# listed in the documentation: https://docs.scipy.org/doc/scipy/reference/stats.html.
# scipy is included in the standard Anaconda python distribution.
loc = 0 # the mean
scale = 1 # the standard deviation
# a scipy.stats normal distribution
# scipy.stats supports 50+ continuous distributions.
d = stats.norm(loc, scale)
# a scipy.stat distribution includes these 3 methods:
# norm.pdf(x) # the value of the pdf at x. This is what you asked for.
# norm.cdf(x) # the cumulative probability at x.
# norm.ppf(p) # the inverse of the cdf(). The critical value that gives cumulative probability, p.
# d.pdf(x) gives the probability you asked for.
print(f'The value of the pdf at x = 0 (the 50th percentile, a.k.a. the median: {d.pdf(0)}')
# d.cdf(x) gives the cumulative probability at x (x is a critical value of the normal distribution.
print(f'The value of the cumulative distribution at x = .5 (the 50th percentile, a.k.a. the median: {d.cdf(d.ppf(.5))}')
# d.ppf(p) is the inverse of cdf. The critical value that gives cumulative probability, p.
print(f'The normal critical value that gives a cumulative probability = .5: {d.ppf(.5)}')
# plot the distribution over these percentiles.
quantile_range = (.01, .99)
# generate sample_size quantile values for the x-axis
# of the plot of the probability distribution function (pdf)
sample_size = 100
x = np.linspace(d.ppf(quantile_range[0]), d.ppf(quantile_range[1]), sample_size)
y = d.pdf(x) # return an array of probabilities (pdf values) for x
# setup the plot area
plt.style.use('seaborn-darkgrid')
fig, ax = plt.subplots()
# If ypu move your mouse along the curve, you will
# see the value of the pdf in in the lower left of the plot (mouse tips)
ax.plot(x, y, color='black', linewidth=1.5)
plt.show()
plt.close()

Using scipy gaussian kernel density estimation to calculate CDF inverse

The gaussian_kde function in scipy.stats has a function evaluate that can returns the value of the PDF of an input point. I'm trying to use gaussian_kde to estimate the inverse CDF. The motivation is for generating Monte Carlo realizations of some input data whose statistical distribution is numerically estimated using KDE. Is there a method bound to gaussian_kde that serves this purpose?
The example below shows how this should work for the case of a Gaussian distribution. First I show how to do the PDF calculation to set up the specific API I'm trying to achieve:
import numpy as np
from scipy.stats import norm, gaussian_kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
Is there an analogously simple way to compute the inverse CDF? The norm function has a very handy isf function that does exactly this:
cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)
Does such a function exist for kde_gaussian? Or is it straightforward to construct such a function from the already implemented methods?

The method integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) in vector form may speed things up.
from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)
The plot of the inverse function would be plot(pde_cdf, x). If the goal is to compute the inverse function at a specific point, consider using the inverse of interpolating spline, interpolating the computed values of the CDF.

You can use some python tricks for fast and memory-effective estimation of the CDF (based on this answer):
from scipy.special import ndtr
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
It works as fast as this answer, but has linear (len(kde.dataset)) space complexity instead of the quadratic (actually, len(kde.dataset) * len(x)) one.
All you have to do next is to use inverse approximation, for instance, from statsmodels.

The question has been answered in the other answers but it took me a while to wrap my mind around everything. Here is a complete example of the final solution:
import numpy as np
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)
# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)
fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")

How to plot the angle frequency distribution curve in python

I have only the angle values for a set of data. Now i need to plot a angle distribution curve ie., angle on the x axis v/s no.of times/frequency of angle occurring on the y axis.
These are the angles sorted out for a set of data:-
[98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381, 101.70374296,
103.15715294, 104.4653976,105.50441485, 106.82885361, 107.4605319, 108.93228646,
111.22463712, 112.23658018, 113.31223886, 113.4000603, 114.14565594, 114.79809084,
115.15788861, 115.42991416, 115.66216071, 115.69821092, 116.56319054, 117.09232139,
119.30835385, 119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387,]
How can i do this..??
How can i write a python program for this...?? and plot it in a graph as a distribution curve

Function hist actually returns the x and y coordinates of the bins. You can use this function to prepare the data for the line plot:
y, x, _ = plt.hist(angles) # No need for the 3rd return value
xc = (x[:-1] + x[1:]) / 2 # Take centerpoints
# plt.clf()
plt.plot(xc, y)
plt.show() # Etc.
You will end up having both the histogram and the line plot. If this is not desirable, clean the canvas before plotting the line by uncommenting the call to clf().

EDIT:
If you want a line plot as well, it is better to generate the histogram with numpy and then use that information also for the line:
from matplotlib import pyplot as plt
import numpy as np
angles = [98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381,
101.70374296, 103.15715294, 104.4653976, 105.50441485, 106.82885361,
107.4605319, 108.93228646, 111.22463712, 112.23658018, 113.31223886,
113.4000603, 114.14565594, 114.79809084, 115.15788861, 115.42991416,
115.66216071, 115.69821092, 116.56319054, 117.09232139, 119.30835385,
119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387, ]
hist,edges = np.histogram(angles, bins=20)
bin_centers = 0.5*(edges[:-1] + edges[1:])
bin_widths = (edges[1:]-edges[:-1])
plt.bar(bin_centers,hist,width=bin_widths)
plt.plot(bin_centers, hist,'r')
plt.xlabel('angle [$^\circ$]')
plt.ylabel('frequency')
plt.show()
this looks like this:
If you are not interested in the histogram itself, leave out the line plt.bar(bin_centers,hist,width=bin_widths).
EDIT2:
I don't really see the scientific value in a smoothed histogram. If you increase the resolution of the histogram (the bins parameter in the np.histogram command), it can change quite considerably. For instance, new peaks may occur if you increase the bin count, or two peaks may merge into one if you decrease the bin count. Keeping this in mind, smoothing the histogram curve suggests that you have more data than you do. However, if you really must, you can smooth a curve as explained in this answer, i.e.
from scipy.interpolate import spline
x = np.linspace(edges[0], edges[-1], 500)
y = spline(bin_centers, hist, x)
and then plot y over x.

A lognormal distribution in python

I have seen several questions in stackoverflow regarding how to fit a log-normal distribution. Still there are two clarifications that I need known.
I have a sample data, the logarithm of which follows a normal distribution. So I can fit the data using scipy.stats.lognorm.fit (i.e a log-normal distribution)
The fit is working fine, and also gives me the standard deviation. Here is my piece of code with the results.
import numpy as np
from scipy import stats
sample = np.log10(data) #taking the log10 of the data
scatter,loc,mean = stats.lognorm.fit(sample) #Gives the paramters of the fit
x_fit = np.linspace(13.0,15.0,100)
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,mean) #Gives the PDF
print "scatter for data is %s" %scatter
print "mean of data is %s" %mean
THE RESULT
scatter for data is 0.186415047243
mean for data is 1.15731050926
From the image you can clearly see that the mean is around 14.2, but what I get is 1.15??!! Why is this so? clearly the log(mean) is also not near 14.2!!
In THIS POST and in THIS QUESTION it is mentioned that the log(mean) is the actual mean.
But you can see from my above code, the fit that I have obtained is using a the sample = log(data) and it also seems to fit well. However when I tried
sample = data
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,np.log10(mean))
The fit does not seem to work.
1) Why is the mean not 14.2?
2) How to draw fill/draw vertical lines showing the 1 sigma confidence region?

You say
I have a sample data, the logarithm of which follows a normal distribution.
Suppose data is the array containing the samples. To fit this data to
a log-normal distribution using scipy.stats.lognorm, use:
s, loc, scale = stats.lognorm.fit(data, floc=0)
Now suppose mu and sigma are the mean and standard deviation of the
underlying normal distribution. To get the estimate of those values
from this fit, use:
estimated_mu = np.log(scale)
estimated_sigma = s
(These are not the estimates of the mean and standard deviation of
the samples in data. See the wikipedia page for the formulas
for the mean and variance of a log-normal distribution in terms of mu and sigma.)
To combine the histogram and the PDF, you can use, for example,
import matplotlib.pyplot as plt.
plt.hist(data, bins=50, normed=True, color='c', alpha=0.75)
xmin = data.min()
xmax = data.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.lognorm.pdf(x, s, scale=scale)
plt.plot(x, pdf, 'k')
If you want to see the log of the data, you could do something like
the following. Note the the PDF of the normal distribution is used
here.
logdata = np.log(data)
plt.hist(logdata, bins=40, normed=True, color='c', alpha=0.75)
xmin = logdata.min()
xmax = logdata.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.norm.pdf(x, loc=estimated_mu, scale=estimated_sigma)
plt.plot(x, pdf, 'k')
By the way, an alternative to fitting with stats.lognorm is to fit log(data)
using stats.norm.fit:
logdata = np.log(data)
estimated_mu, estimated_sigma = stats.norm.fit(logdata)
Related questions:
Fitting lognormal distribution using Scipy vs Matlab
Lognormal Random Numbers Centered around a high value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.