Using scipy gaussian kernel density estimation to calculate CDF inverse - python

The gaussian_kde function in scipy.stats has a function evaluate that can returns the value of the PDF of an input point. I'm trying to use gaussian_kde to estimate the inverse CDF. The motivation is for generating Monte Carlo realizations of some input data whose statistical distribution is numerically estimated using KDE. Is there a method bound to gaussian_kde that serves this purpose?
The example below shows how this should work for the case of a Gaussian distribution. First I show how to do the PDF calculation to set up the specific API I'm trying to achieve:
import numpy as np
from scipy.stats import norm, gaussian_kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
Is there an analogously simple way to compute the inverse CDF? The norm function has a very handy isf function that does exactly this:
cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)
Does such a function exist for kde_gaussian? Or is it straightforward to construct such a function from the already implemented methods?

The method integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) in vector form may speed things up.
from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)
The plot of the inverse function would be plot(pde_cdf, x). If the goal is to compute the inverse function at a specific point, consider using the inverse of interpolating spline, interpolating the computed values of the CDF.

You can use some python tricks for fast and memory-effective estimation of the CDF (based on this answer):
from scipy.special import ndtr
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
It works as fast as this answer, but has linear (len(kde.dataset)) space complexity instead of the quadratic (actually, len(kde.dataset) * len(x)) one.
All you have to do next is to use inverse approximation, for instance, from statsmodels.

The question has been answered in the other answers but it took me a while to wrap my mind around everything. Here is a complete example of the final solution:
import numpy as np
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)
# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)
fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")

Related

Why does pdf of arange function have normal distribution?

arange works on stepwise incrementing values and is not random function then why does it give a random distribution?
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))
I expect a uniform distribution
The library scipy.stats.norm provides functionality of Normal Distribution, not Uniform distribution. Meaning when you apply the probability density function (pdf), you are not applying a constant function, but something else entirely (also knowns as the Bell curve):
https://en.wikipedia.org/wiki/Normal_distribution
So in the end what you are seeing are points between (-3, 3) visualised on the probability density function of Normal distribution. If you want to see Uniform distribution:
from scipy.stats import uniform
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, uniform.pdf(x))
But that is just a very fancy way to draw a constant line.

Python piecewise function interpolation

i am trying to construct a function which gives me interpolated values of a piecewise linear function. I tried linear spline interpolation (which should be able to do exactly this?)- but without any luck. The problem is most visible on a log scale plot. Below there is the code of a small example i prepared:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
from scipy import interpolate
#Original Data
pwl_data = np.array([[0,1e3, 1e5, 1e8], [-90,-90, -90, -130]])
#spine interpolation
pwl_spline = interpolate.splrep(pwl_data[0], pwl_data[1])
spline_x = np.linspace (0,1e8, 10000)
legend = []
plt.plot(pwl_data[0],pwl_data[1])
plt.plot(spline_x,interpolate.splev(spline_x,pwl_spline ),'*')
legend.append("Data")
legend.append("Interpolated Data")
plt.xscale('log')
plt.legend(legend)
plt.grid(True)
plt.grid(b=True, which='minor', linestyle='--')
plt.show()
What am I doing wrong?
The spline fitting have to be performed on the linearized data, i.e. using log(x) instead of x:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
#Original Data
pwl_data = np.array([[1, 1e3, 1e5, 1e8], [-90, -90, -90, -130]])
x = pwl_data[0]
y = pwl_data[1]
log_x = np.log(x)
#spine interpolation
pwl_spline = interpolate.splrep(log_x, y)
spline_log_x = np.linspace(0, 18, 30)
spline_y = interpolate.splev(spline_log_x, pwl_spline )
plt.plot(log_x, y, '-o')
plt.plot(spline_log_x, spline_y, '-*')
plt.xlabel('log(x)');
note: I remove the zero from the data. Also, spline fitting could be not the best if you want a piecewise linear function, you could have a look at this question for example: https://datascience.stackexchange.com/q/8457/53362
For plotting with matplotlib, consider matplotlibs step which internally performs a piecewise constant interpolation.
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.step.html
you can invoke it simply via:
plt.step(x,y) given your inputs x and y.
In plotly the argument line_shape='hv' for the Scatter plot achieves similar results see https://plotly.com/python/line-charts/

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

Confidence interval for LOWESS in Python

How would I calculate the confidence intervals for a LOWESS regression in Python? I would like to add these as a shaded region to the LOESS plot created with the following code (other packages than statsmodels are fine as well).
import numpy as np
import pylab as plt
import statsmodels.api as sm
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)
plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()
I've added an example plot with confidence interval below from the webblog Serious Stats (it is created using ggplot in R).
LOESS doesn't have an explicit concept for standard error. It just doesn't mean anything in this context. Since that's out, your stuck with the brute-force approach.
Bootstrap your data. Your going to fit a LOESS curve to the bootstrapped data. See the middle of this page to find a pretty picture of what your doing. http://statweb.stanford.edu/~susan/courses/s208/node20.html
Once you have your large number of different LOESS curves, you can find the top and bottom Xth percentile.
This is a very old question but it's one of the first that pops up on google search. You can do this using the loess() function from scikit-misc. Here's an example (I tried to keep your original variable names, but I bumped up the noise a bit to make it more visible)
import numpy as np
import pylab as plt
from skmisc.loess import loess
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.4
l = loess(x,y)
l.fit()
pred = l.predict(x, stderror=True)
conf = pred.confidence()
lowess = pred.values
ll = conf.lower
ul = conf.upper
plt.plot(x, y, '+')
plt.plot(x, lowess)
plt.fill_between(x,ll,ul,alpha=.33)
plt.show()
result:
For a project of mine, I need to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of LowessSmoother:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
from tsmoothie.utils_func import sim_randomwalk
# generate 10 randomwalks of length 200
np.random.seed(33)
data = sim_randomwalk(n_series=10, timesteps=200,
process_noise=10, measure_noise=30)
# operate smoothing
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)
# generate intervals
low, up = smoother.get_intervals('prediction_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way. Hope this can help someone

fitting curve to histogram and extracting functional form - Python

If i have plotted a histogram in python using matplotlib, how can i easily extract the functional form of the histogram, or i suppose, the function of the bestfit curve to the histogram. Im not sure how to plot this bestfit curve. Any help is appreciated, thanks.
The shape of my histogram is like an inverted lennard-jones potential.
I'm just going to answer both to be thorough. These are two separate problems: fitting a function to your histogram data and then plotting the function. First of all, scipy has an optimization module that you can use to fit your function. Among those curve_fit is probably the easiest.
To give an example,
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
# Model function
def f(x, a, b):
return a * x + b
# Example data
x = np.linspace(0, 10, 20)
y = f(x, 0.2, 3.4) + 0.2 * np.random.normal(size=len(x))
# Do the fit
popt, pcov = curve_fit(f, x, y, [1.0, 1.0])
From curve_fit you get the optimized parameters a, b to your function f and the statistical covariances. You can also pass error for statistical weights as sigma to it.
Now you can plot the data and the histogram. I guess it makes sense to use a higher resolution in x for the curve.
# Plot data
plt.plot(x, y, 'o')
# Plot fit curve
fit_x = np.linspace(0, 10, 200)
plt.plot(fit_x, f(fit_x, *popt))
plt.show()
I haven't specifically dealt with a histogram nor a Lennard-Jones potential here to limit the complexity of the code and focus on the part you asked about. But this example can be adapted to any kind of least squares optimization issue.

Categories

Resources