I do not understand this homework question i have received,our task is to develop the Python function normdist(x,mu,sigma) , which evaluates the multivariate
Gaussian probability density function for the k dimensional vector x , the mean vector μ and the covariance
matrix Σ . In the special case where k = 1 , this function evaluates the univariate Gaussian probability
density function for the scalar x , the mean μ and the standard deviation σ .
My attempt is below:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)**k*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, k):
if k:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Above code credited to Cerin
How do i handle the case of k =1 ?
Related
IF :
The PDF of the normal distribution is:
scipy.stats.norm.pdf(x, mu, sigma)
Its first derivative with respect to x would be:
scipy.stats.norm.pdf(x, mu, sigma)*(mu - x)/sigma**2
What would be the second derivation?
You can apply the product rule
f(x)*g(x) = f(x)*g'(x) + f'(x)*g(x)
Where f(x) = pdf(x, mu, sigma), and g(x)=(mu-x)/sigma**2.
Then f'(x) = f(x) * g(x)
And g'(x) = -1/sigma**2
Putting all to gether you have the second derivative of the PDF as
def second_derivative(x, mu, sigma):
g = (mu - x)**2/sigma**2;
return scipy.stats.norm.pdf(x, mu, sigma)*(g**2 - 1/sigma**2)
I would like to use the curve_fit function from the scipy.optimize module to determine amplitudes, frequencies, phases of sum of sine functions (and one y0). It's easy to do when I know a number of sines to use. For example when I know two frequencies from the DFT (Discrete Fourier Transform): 1.152 and 0.432 I can define a function:
def func(x, amp1, amp2, freq1 , freq2, phase1, phase2, y0):
return amp1*np.sin(freq1*x + phase1) + amp2*np.sin(freq2*x + phase2) + y0
Then, using the curve_fit and constraining intervals of frequencies I can find a good fitting:
param, _ = curve_fit(func, t, data, bounds=([-np.inf, -np.inf, 1.14, 0.43, -np.inf, -np.inf, -np.inf], [np.inf, np.inf, 1.16, 0.44, np.inf, np.inf, np.inf]))
It looks great:
But in this case I've prepared the data and I've known a number of frequencies. Do you know how to define the func only once and handle all cases (for example five sine functions)? I've tried to put the parameters into lists, e.g. amp = [amp1, amp2, ... ] and I've iterated over their length. But there is a problem to define bounds for parameter lists. bounds is very important to ensure reality model.
The solution does not have to based on curve_fit.
Assuming you know the frequencies beforehand the problem is simple. You can set the lower bound to 0 and set the upper bound to 2 * pi * freq for frequency. For amps, set any number (or np.inf if you want no boundary).
You can formulate the function in the form lambda x, amp1, phase1, amp2, phase2... : y, curve_fit can accept a function of undefined number of arguments as long as you supply a proper initial guess.
A sample code for five frequencies:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,10,60)
w = [1,2,3,4,5]
a = [1,4,2,3,0.1]
x0 = [0,1,0,1,0.5]
y = np.sum(a_i * np.sin(w_i * x - x0_i) for w_i, a_i, x0_i in zip(w,a, x0)) #base_data
yr = y + np.random.normal(0,0.5, size=x.size) #noisy data
def func(x, *args):
""" function of the form lambda x, amp1, phase1, amp2, phase2...."""
return np.sum(a_i * np.sin(w_i * (x-x0)) for w_i, a_i, x0
in zip(w,args[::2], args[1::2]))
ubounds = np.zeros(len(w) * 2)
ubounds[::2] = 10 #setting amp max value to 10 (arbitrary)
ubounds[1::2] = np.asarray(w) * 2 * np.pi
p0 = [0] * 10 # note p0 size
popt, pcov = curve_fit(func, x, yr, p0, bounds=(0, ubounds))
amps, phases = popt[::2], popt[1::2]
plt.plot(x,func(x, *popt))
plt.plot(x,yr, 'go')
I'm trying find a solution to integrate Density of multivariate normal distribution.
I have 100 points dataset (x,y) and a covariance matrix (sigma) of these data
I have an idea to integrate density that I integrate each value of covariance matrix (x[i] to x[j]) and then sum all integrated values. Is it correct?
def gaussian(x, mu, sig):
return np.exp(-(x - mu)**2/ (2 * sig**2))
I = np.zeros(len(sigma), dtype=float)
for i in range(0, len(sigma)):
I[i] = quad(gaussian, x1[i] , x1[i+1] , args=(0, sigma[i]))[0]
sum(I)
In scipy the negative binomial distribution is defined as:
nbinom.pmf(k) = choose(k+n-1, n-1) * p**n * (1-p)**k
This is the common definition, see also wikipedia:
https://en.wikipedia.org/wiki/Negative_binomial_distribution
However, there exists a different parametrization where the negative Binomial is defined by the mean mu and the dispersion parameter.
In R this is easy, as the negbin can be defined by both parametrizations:
dnbinom(x, size, prob, mu, log = FALSE)
How can I use the mean/dispersion parametrization in scipy ?
edit:
straight from the R help:
The negative binomial distribution with size = n and prob = p has density
Γ(x+n)/(Γ(n) x!) p^n (1-p)^x
An alternative parametrization (often used in ecology) is by the mean mu (see above), and size, the dispersion parameter, where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization.
It is also describe here in more detail:
https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
from scipy.stats import nbinom
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
def pmf(counts, mu, theta):
"""
>>> import numpy as np
>>> from scipy.stats import poisson
>>> np.isclose(pmf(10, 10, 10000), poisson.pmf(10, 10), atol=1e-3)
True
"""
return nbinom.pmf(counts, *convert_params(mu, theta))
def logpmf(counts, mu, theta):
return nbinom.logpmf(counts, *convert_params(mu, theta))
def cdf(counts, mu, theta):
return nbinom.cdf(counts, *convert_params(mu, theta))
def sf(counts, mu, theta):
return nbinom.sf(counts, *convert_params(mu, theta))
The Wikipedia page you linked given a precise formula for p and r in terms of mu and sigma, see the very last bullet item in the Alternative parametrization section,https://en.m.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
I have a function, a gaussian, I have fitted this to my data from a data file. I now need to integrate the gaussian function to give the area under it.
This is my gaussian function
def I(theta,max_x,max_y,sigma):
return (max_y/(sigma*(math.sqrt(2*pi))))*np.exp(-((theta-max_x)**2)/(2*sigma**2))
COMPARING WITH GENERAL FORMULA
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
i.e n = max_y , MU = max_x , x = theta
this is what is given on another page:
If Phi(z) = integral(N(x|0,1,1), -inf, z); that is, Phi(z) is the integral of the standard normal distribution from >minus infinity up to z, then it's true by the definition of the error function that
Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2)).
Likewise, if Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n),
-inf, z); that is, Phi(z | mu, sigma, n) is the integral of the normal distribution given parameters mu, sigma, and n from minus infinity up
to z, then it's true by the definition of the error function that
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma *
sqrt(2)))).
I am unsure how this helps?? I just want to integrate my function over the plotted values under the curve. Is it saying this is the integral:
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2))))
The answer you have there is the indefinite integral. If you would like a numerical answer between two x limits, you can evaluate that function at two points and take the difference.
Your gaussian function is defined over all real numbers (−∞, +∞) but in practice, you are only interested in the middle part as the tails are very close to 0. To obtain a numerical estimate of the total area you can do as you say: evaluate the error function at two points suitably close to 0 on each side of the gaussian's peak and take the difference.
If Phi(z | mu, sigma, n) returns a function you could do:
integral = Phi(z | mu, sigma, n)
area = integral(X_HIGH) - integral(X_LOW)