I have a function, a gaussian, I have fitted this to my data from a data file. I now need to integrate the gaussian function to give the area under it.
This is my gaussian function
def I(theta,max_x,max_y,sigma):
return (max_y/(sigma*(math.sqrt(2*pi))))*np.exp(-((theta-max_x)**2)/(2*sigma**2))
COMPARING WITH GENERAL FORMULA
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
i.e n = max_y , MU = max_x , x = theta
this is what is given on another page:
If Phi(z) = integral(N(x|0,1,1), -inf, z); that is, Phi(z) is the integral of the standard normal distribution from >minus infinity up to z, then it's true by the definition of the error function that
Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2)).
Likewise, if Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n),
-inf, z); that is, Phi(z | mu, sigma, n) is the integral of the normal distribution given parameters mu, sigma, and n from minus infinity up
to z, then it's true by the definition of the error function that
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma *
sqrt(2)))).
I am unsure how this helps?? I just want to integrate my function over the plotted values under the curve. Is it saying this is the integral:
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2))))
The answer you have there is the indefinite integral. If you would like a numerical answer between two x limits, you can evaluate that function at two points and take the difference.
Your gaussian function is defined over all real numbers (−∞, +∞) but in practice, you are only interested in the middle part as the tails are very close to 0. To obtain a numerical estimate of the total area you can do as you say: evaluate the error function at two points suitably close to 0 on each side of the gaussian's peak and take the difference.
If Phi(z | mu, sigma, n) returns a function you could do:
integral = Phi(z | mu, sigma, n)
area = integral(X_HIGH) - integral(X_LOW)
Related
IF :
The PDF of the normal distribution is:
scipy.stats.norm.pdf(x, mu, sigma)
Its first derivative with respect to x would be:
scipy.stats.norm.pdf(x, mu, sigma)*(mu - x)/sigma**2
What would be the second derivation?
You can apply the product rule
f(x)*g(x) = f(x)*g'(x) + f'(x)*g(x)
Where f(x) = pdf(x, mu, sigma), and g(x)=(mu-x)/sigma**2.
Then f'(x) = f(x) * g(x)
And g'(x) = -1/sigma**2
Putting all to gether you have the second derivative of the PDF as
def second_derivative(x, mu, sigma):
g = (mu - x)**2/sigma**2;
return scipy.stats.norm.pdf(x, mu, sigma)*(g**2 - 1/sigma**2)
I am trying to fit a Gaussian model onto gaussian distributed data (x,y) , using scipy's curve_fit. I am trying to tweak the parameters of the fitting, in order to get better fitting. I saw that curve_fit calls scipy.optimize.least_sq with the method LM (Levenberg-Marquardt method). It seems to me that it constructs a function that evaluates the least square criterion at each data point. In my example, I have 8 data points. In my comprehension and according to scipy's documentation gtol is "Orthogonality desired between the function vector and the columns of the Jacobian."
popt, pcov = optimize.curve_fit(parametrized_gaussian, patch_indexes * pixel_size, sub_sig,
p0=p0, jac=gaussian_derivative_wrt_param, maxfev=max_fev, gtol=1e-11, ftol=1e-11, xtol=1e-11)
parametrized_gaussian is simply :
def parametrized_gaussian(x, a, x0, sigma) :
res = a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
return res.astype('float64')
and gaussian_derivative_wrt_param is
def gaussian_derivative_wrt_param(x, a, x0, sigma):
return np.array([parametrized_gaussian(x, a, x0, sigma) / a,
2 * (x - x0) / (sigma ** 2) * parametrized_gaussian(x, a, x0, sigma),
(x - x0) ** 2 / (sigma ** 3) * parametrized_gaussian(x, a, x0, sigma)]). swapaxes(0, -1).astype('float64')
I wanted to check that the value of the jacobian at the resulting optimal parameters. I do not understand the values that I get. When curve_fit calls leastsq, it then uses :
retval = _minpack._lmder(func, Dfun, x0, args, full_output,
col_deriv, ftol, xtol, gtol, maxfev,
factor, diag)
I print Dfun(retval[0]), because retval[0] is the values of optimal parameters. This is what I get.
0.18634,-6175.62246,5660.31995
0.50737, -10685.47212, 6223.84575
0.88394, -7937.93400, 1971.45501
0.98540, 3054.98273, 261.93803
0.70291, 10670.53623, 4479.93075
0.32083, 8746.05579, 6594.01140
0.09370, 3686.25245, 4010.79420
0.01751, 900.40686, 1280.50557
How does this respect gtol ??
Results for Dfun(optimal parameters on the grid of 8 points
That is why I think I do not understand how gtol works.
From scipy/optimize/minpack/lmder.f, we find a more detailed description
c gtol is a nonnegative input variable. termination
c occurs when the cosine of the angle between fvec and
c any column of the jacobian is at most gtol in absolute
c value. therefore, gtol measures the orthogonality
c desired between the function vector and the columns
c of the jacobian.
This just means that if gtol=0, then f(x_optimal) and columns of the jacobian are perpendicular on convergence. If this is the case, then f'(x_optimal).T # f(x_optimal) is a zero matrix. Since this product is used as part of the iteration, it makes sense to stop when this is 0, because no more progress can made.
I have very little knowledge of statistics, so forgive me, but I'm very confused by how the numpy function std works, and the documentation is unfortunately not clearing it up.
From what I understand it will compute the standard deviation of a distribution from the array, but when I set up a Gaussian with a standard deviation of 0.5 with the following code, numpy.std returns 0.2:
sigma = 0.5
mu = 1
x = np.linspace(0, 2, 100)
f = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-1 / 2) * ((x - mu) / sigma)**2)
plt.plot(x, f)
plt.show()
print(np.std(f))
This is the distribution:
I have no idea what I'm misunderstanding about how the function works. I thought maybe I would have to tell it the x-values associated with the y-values of the distribution but there's no argument for that in the function. Why is numpy.std not returning the actual standard deviation of my distribution?
I suspect that you understand perfectly well how the function works, but are misunderstanding the meaning of your data. Standard deviation is a measure of the spread of data about the mean value.
When you say std(f), you are computing the the spread of the y-values about their mean. Looking at the graph in the question, a vertical mean of ~0.5 and a standard deviation of ~0.2 are not far fetched. Notice that std(f) does not involve the x-values in any way.
What you are expecting to get is the standard deviation of the x-values, weighted by the y-values. This is essentially the idea behind a probability density function (PDF).
Let's go through the computation manually to understand the differences. The mean of the x-values is normally x.sum() / x.size. But that is only true the the weight of each value is 1. If you weigh each value by the corresponding f value, you can write
m = (x * f).sum() / f.sum()
Standard deviation is the root-mean-square about the mean. That means computing the average squared deviation from the mean, and taking the square root. We can compute the weighted mean of squared deviation in the exact same way we did before:
s = np.sqrt(np.sum((x - m)**2 * f) / f.sum())
Notice that the value of s computed this way from your question is not 0.5, but rather 0.44. This is because your PDF is incomplete, and the missing tails add significantly to the spread.
Here is an example showing that the standard deviation converges to the expected value as you compute it for a larger sample of the PDF:
>>> def s(x, y):
... m = (x * y).sum() / y.sum()
... return np.sqrt(np.sum((x - m)**2 * y) / y.sum())
>>> sigma = 0.5
>>> x1 = np.linspace(-1, 1, 100)
>>> y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x1 / sigma)**2)
>>> s(x1, y1)
0.4418881290522094
>>> x2 = np.linspace(-2, 2, 100)
>>> y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x2 / sigma)**2)
>>> s(x2, y2)
0.49977093783005005
>>> x3 = np.linspace(-3, 3, 100)
>>> y3 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x3 / sigma)**2)
>>> s(x3, y3)
0.49999998748515206
np.std is used to compute standard deviation. This can be computed as below steps
First we need to compute mean of distribution
Then find the summation of (x - x.mean)**2
Then find the means of above summation (by divide with number of elements in distribution)
Then find square root of this means (calculated in step 3).
Thus this function is calculating the standard deviation of distribution being passed to it.
I do not understand this homework question i have received,our task is to develop the Python function normdist(x,mu,sigma) , which evaluates the multivariate
Gaussian probability density function for the k dimensional vector x , the mean vector μ and the covariance
matrix Σ . In the special case where k = 1 , this function evaluates the univariate Gaussian probability
density function for the scalar x , the mean μ and the standard deviation σ .
My attempt is below:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)**k*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, k):
if k:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Above code credited to Cerin
How do i handle the case of k =1 ?
I am learning gradient descent for calculating coefficients. Below is what I am doing:
#!/usr/bin/Python
import numpy as np
# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
cost = np.sum(loss ** 2) / (2 * m)
#print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
return theta
X = np.array([41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8])
y = np.array([251.3,251.3,248.3,267.5,273.0,276.5,270.3,274.9,285.0,290.0,297.0,302.5,304.5,309.3,321.7,330.7,349.0])
n = np.max(X.shape)
x = np.vstack([np.ones(n), X]).T
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)
Now my above code works fine. If I now try multiple variables and replace X with X1 like the following:
X1 = np.array([[41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8], [29.1,29.3,29.5,29.7,29.9,30.3,30.5,30.7,30.8,30.9,31.5,31.7,31.9,32.0,32.1,32.5,32.9]])
then my code fails and shows me the following error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
[ nan nan nan]
Can anybody tell me how can I do gradient descent using X1? My expected output using X1 is:
[-153.5 1.24 12.08]
I am open to other Python implementations also. I just want the coefficients (also called thetas) for X1 and y.
The problem is in your algorithm not converging. It diverges instead. The first error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
comes from the problem that at some point calculating the square of something is impossible, as the 64-bit floats cannot hold the number (i.e. it is > 10^309).
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
This is only a consequence of the error before. The numbers are not reasonable for calculations.
You can actually see the divergence by uncommenting your debug print line. The cost starts to grow, as there is no convergence.
If you try your function with X1 and a smaller value for alpha, it converges.