fitting curve to histogram and extracting functional form - Python - python

If i have plotted a histogram in python using matplotlib, how can i easily extract the functional form of the histogram, or i suppose, the function of the bestfit curve to the histogram. Im not sure how to plot this bestfit curve. Any help is appreciated, thanks.
The shape of my histogram is like an inverted lennard-jones potential.

I'm just going to answer both to be thorough. These are two separate problems: fitting a function to your histogram data and then plotting the function. First of all, scipy has an optimization module that you can use to fit your function. Among those curve_fit is probably the easiest.
To give an example,
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
# Model function
def f(x, a, b):
return a * x + b
# Example data
x = np.linspace(0, 10, 20)
y = f(x, 0.2, 3.4) + 0.2 * np.random.normal(size=len(x))
# Do the fit
popt, pcov = curve_fit(f, x, y, [1.0, 1.0])
From curve_fit you get the optimized parameters a, b to your function f and the statistical covariances. You can also pass error for statistical weights as sigma to it.
Now you can plot the data and the histogram. I guess it makes sense to use a higher resolution in x for the curve.
# Plot data
plt.plot(x, y, 'o')
# Plot fit curve
fit_x = np.linspace(0, 10, 200)
plt.plot(fit_x, f(fit_x, *popt))
plt.show()
I haven't specifically dealt with a histogram nor a Lennard-Jones potential here to limit the complexity of the code and focus on the part you asked about. But this example can be adapted to any kind of least squares optimization issue.

Related

To plot graph non linear function

I want to plot graph of this function:
y = 2[1-e^(-x+1)]^2-2
When I plot a linear function, I used this code :
import matplotlib.pyplot as plt
import numpy as np
x = np.array(...)
y = np.array(...)
z = np.polyfit(x, y, 2)
p = np.poly1d(z)
xp = np.linspace(...)
_ = plt.plot(x, y, '.', xp, p(xp), '-')
plt.ylim(0, 200)
plt.show()
When the function is non-linear, it does not works
becasue it hard to find each x,y value.
How can I plot a non-linear function?
I hate to be the one to break this news to you, but polynomials of order greater than one are technically nonlinear too.
When you plot in matplotlib, you're really supplying discreet x and y values at a resolution sufficient to be visually pleasing. In this case, you've chosen xp to determine the points you plot for the parabola. You then call p(xp) to generate an array of y-values at those locations.
There nothing stopping you from generating y-values for your formula of interest using simple numpy functions:
y = 2 * (1 - np.exp(1 - xp))**2 - 2

Use of curve_fit to fit data of 2 variables in a list

I am kind of new to scipy and curve_fit.
I have 2 lists:
x values:
[0.723938224, 0.965250965, 1.206563707, 1.447876448, 1.689189189,
1.930501931, 2.171814672]
y values:
[2.758, 2.443, 2.142333333, 1.911, 1.817666667, 1.688333333, 1.616]
I would like to perform a curve_fit on these 2 datasets, but I cannot seem to figure out the relationship. I known roughly the equation that fits them both together:
I know that there' an equation that fits them together:
0.74/((9.81*(x/100))^(1/2))
But how would I prove that the equation is the equation above just using python curve fits. If I do a similar thing in excel, it would automatically give me the equation. How would it work in python?
I am not sure how to perform the curve_fit and draw the trendline. Could someone help? Thanks.
For a start, let's define the curve fit function. You say that Excel tells you the function is of the form a/(b*(x/c)**d). I tell you, Excel knows nothing about anything apart from autofill; this equation can easily be transformed into ((a*c**d)/b)/x**d, so the function we actually have to consider is of the form a/x**b.
Now to the actual curve fitting with scipy:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [0.723938224, 0.965250965, 1.206563707, 1.447876448, 1.689189189, 1.930501931, 2.171814672]
y = [2.758, 2.443, 2.142333333, 1.911, 1.817666667, 1.688333333, 1.616]
def func(x, a, b):
return a/(x**b)
#start values, not really necessary here but good to know the concept
p0 = [2, 0.5]
#the actual curve fitting, returns the parameters in popt and the covariance matrix in pcov
popt, pcov = curve_fit(func, np.asarray(x), np.asarray(y), p0)
#print out the parameters a, b
print(*popt)
#a=2.357411406488454, b=0.5027391574181408
#plot the function to see, if the fit is any good
#first the raw data
plt.scatter(x, y, marker="x", color="red", label="raw data")
#then the fitted curve
x_fit = np.linspace(0.9*min(x), 1.1*max(x), 1000)
y_fit = func(x_fit, *popt)
plt.plot(x_fit, y_fit, color="blue", label="fitted data")
plt.legend()
plt.show()
Output:
Looks good to me. And one shouldn't let Excel near any statistical data, if you asked me.

Using scipy gaussian kernel density estimation to calculate CDF inverse

The gaussian_kde function in scipy.stats has a function evaluate that can returns the value of the PDF of an input point. I'm trying to use gaussian_kde to estimate the inverse CDF. The motivation is for generating Monte Carlo realizations of some input data whose statistical distribution is numerically estimated using KDE. Is there a method bound to gaussian_kde that serves this purpose?
The example below shows how this should work for the case of a Gaussian distribution. First I show how to do the PDF calculation to set up the specific API I'm trying to achieve:
import numpy as np
from scipy.stats import norm, gaussian_kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
Is there an analogously simple way to compute the inverse CDF? The norm function has a very handy isf function that does exactly this:
cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)
Does such a function exist for kde_gaussian? Or is it straightforward to construct such a function from the already implemented methods?
The method integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) in vector form may speed things up.
from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)
The plot of the inverse function would be plot(pde_cdf, x). If the goal is to compute the inverse function at a specific point, consider using the inverse of interpolating spline, interpolating the computed values of the CDF.
You can use some python tricks for fast and memory-effective estimation of the CDF (based on this answer):
from scipy.special import ndtr
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
It works as fast as this answer, but has linear (len(kde.dataset)) space complexity instead of the quadratic (actually, len(kde.dataset) * len(x)) one.
All you have to do next is to use inverse approximation, for instance, from statsmodels.
The question has been answered in the other answers but it took me a while to wrap my mind around everything. Here is a complete example of the final solution:
import numpy as np
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)
# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)
fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")

Confidence interval for LOWESS in Python

How would I calculate the confidence intervals for a LOWESS regression in Python? I would like to add these as a shaded region to the LOESS plot created with the following code (other packages than statsmodels are fine as well).
import numpy as np
import pylab as plt
import statsmodels.api as sm
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)
plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()
I've added an example plot with confidence interval below from the webblog Serious Stats (it is created using ggplot in R).
LOESS doesn't have an explicit concept for standard error. It just doesn't mean anything in this context. Since that's out, your stuck with the brute-force approach.
Bootstrap your data. Your going to fit a LOESS curve to the bootstrapped data. See the middle of this page to find a pretty picture of what your doing. http://statweb.stanford.edu/~susan/courses/s208/node20.html
Once you have your large number of different LOESS curves, you can find the top and bottom Xth percentile.
This is a very old question but it's one of the first that pops up on google search. You can do this using the loess() function from scikit-misc. Here's an example (I tried to keep your original variable names, but I bumped up the noise a bit to make it more visible)
import numpy as np
import pylab as plt
from skmisc.loess import loess
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.4
l = loess(x,y)
l.fit()
pred = l.predict(x, stderror=True)
conf = pred.confidence()
lowess = pred.values
ll = conf.lower
ul = conf.upper
plt.plot(x, y, '+')
plt.plot(x, lowess)
plt.fill_between(x,ll,ul,alpha=.33)
plt.show()
result:
For a project of mine, I need to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of LowessSmoother:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
from tsmoothie.utils_func import sim_randomwalk
# generate 10 randomwalks of length 200
np.random.seed(33)
data = sim_randomwalk(n_series=10, timesteps=200,
process_noise=10, measure_noise=30)
# operate smoothing
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)
# generate intervals
low, up = smoother.get_intervals('prediction_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way. Hope this can help someone

Fitting either Guassian or Gamma distribution to data in Python

I have some measured data which can be either a well established gaussian or something that seems to be a gamma distribution, I currently have the following code (snippet) which performs quite well for data that is nicely gaussian:
def gaussFunction(x, A, mu, sigma):
return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))
# Snippet of the code that does the fitting
p0 = [numpy.max(y_points), x_points[numpy.argmax(y_points)],0.1]
# Attempt to fit a gaussian function to the calibrant space
try:
coeff, var_matrix = curve_fit(self.gaussFunction, x_points, y_points, p0)
newX = numpy.linspace(x_points[0],x_points[-1],1000)
newY = self.gaussFunction(newX, *coeff)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_points, y_points, 'b*')
plt.plot(newX,newY, '--')
plt.show()
Demonstration that it works well for datapoints which are nicely gaussian:
The problem however arises that some of my datapoints are not matching with a good gaussian and I get this:
I would be tempted to try a cubic spline but conceptually I would like to stick to a Gaussian curve fit since that is the data structure that should be within the data (which can occur with a knee or a tail in some data as shown in the second figure). I would highly appreciate if someone has any tip or suggestion on how to deal with this 'issue'.

Categories

Resources