ValueError when fitting subclassed scipy distribution - python

I'm trying to subclass rv_continuous from scipy.stats in order to implement the 3 parameter log-normal distribution. I've only re-implemented the _pdf method so far just to see if I can get a minimal example running:
from scipy import stats
from math import sqrt, pi, exp, log, e
class LogNormal3P(stats.rv_continuous):
def _pdf(self, x, alpha, m, sigma):
return 1 / (sigma * (x - alpha) * sqrt(2 * pi)) * exp(-(log(x - alpha, e) - m)**2 / (2 * sigma**2))
lognorm3p = LogNormal3P(name='lognorm3p')
if __name__ == "__main__":
import numpy as np
data = np.array([16.66, 28.0, 14.3, 15.99, 16.26, 22.69, 19.1, 14.82, 13.91, 11.1])
ln3p = lognorm3p.fit(data)
print('Done')
When running this code I get the error ValueError: math domain error. This is because on the first iteration of the fitting process a point is reached where, x = -11 and the value of the other parameters are equal to 1. This causes a negative value in the log function resulting in the error.
What is the typical work-around for this? None of the data values are negative so I'm assuming this happens during a normalization step. Am I missing any methods of rv_continuous that should be re-implemented prior to trying to fit the data? The documentation mentions that either _pdf or _cdf should be re-implemented at a minimum.

Related

Why does scipy.integrate.quad fail for this integral? [duplicate]

This question already has answers here:
How can you perform this improper integral as Mathematica does?
(2 answers)
Closed 3 years ago.
I am trying to compute the Kullback Leibler divergence between two probability distributions. To do this I need to perform this integral.
Here is my simpified code which currently fails:
from scipy.integrate import quad
import numpy as np
def f(x):
return sum([ps[idx]*lambdas[idx]*np.exp(- lambdas[idx] * x) for idx in range(len(ps))])
def g(x):
return scipy.stats.weibull_min.pdf(x, c=c)
c = 0.9
ps = [1]
lambdas = [1]
eps = 0.001 # weibull_min is only defined for x > 0
print(quad(lambda x: f(x) * np.log(f(x) / g(x)), eps, np.inf)) # Output should be greater than 0
This gives:
(nan, nan)
/home/user/.local/lib/python3.5/site-packages/ipykernel_launcher.py:11: RuntimeWarning: divide by zero encountered in log
# This is added back by InteractiveShellApp.init_path()
/home/user/.local/lib/python3.5/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in double_scalars
# This is added back by InteractiveShellApp.init_path()
/home/user/.local/lib/python3.5/site-packages/ipykernel_launcher.py:11: IntegrationWarning: The occurrence of roundoff error is detected, which prevents
the requested tolerance from being achieved. The error may be
underestimated.
# This is added back by InteractiveShellApp.init_path()
Why doesn't it work and how can I get it to work?
The problem is that f(x)/g(x) tends towards zero and can cause numerical errors. Since the whole integrand tends towards zero quite fast, you can simply integrate over a finite range (say [0.001, 20]) and still get a precise estimation of the integral:
from scipy.stats import weibull_min
from scipy.integrate import quad
import numpy as np
c = 0.9
ps = [1]
lambdas = [1]
def f(x):
return sum([ps[idx]*lambdas[idx]*np.exp(- lambdas[idx] * x) for idx in range(len(ps))])
def g(x):
return scipy.stats.weibull_min.pdf(x, c=c)
print(scipy.integrate.quad(lambda x: f(x) * np.log(f(x) / g(x)), 0.001, 30))
I did not do a numerical analysis of the precision, but according to the comparison with the result from Mathematica, it is precise to the 9th decimal. Here is the test code in Mathematica (simplified for your parameters):
f[x_] := Exp[-x];
c = 0.9;
g[x_] := c*x^(c - 1)*Exp[-x^c];
SetPrecision[Integrate[f[x]*Log[f[x]/g[x]], {x, 0.001, \[Infinity]}],20]
Mathematica result: 0.010089328699390866240
Scipy result: 0.01008932870010536

Improving accuracy in scipy.optimize.fsolve with equations involving integration

I'm trying to solve an integral equation using the following code (irrelevant parts removed):
def _pdf(self, a, b, c, t):
pdf = some_pdf(a,b,c,t)
return pdf
def _result(self, a, b, c, flag):
return fsolve(lambda t: flag - 1 + quad(lambda tau: self._pdf(a, b, c, tau), 0, t)[0], x0)[0]
Which takes a probability density function and finds a result tau such that the integral of pdf from tau to infinity is equal to flag. Note that x0 is a (float) estimate of the root defined elsewhere in the script. Also note that flag is an extremely small number, on the order of 1e-9.
In my application fsolve only successfully finds a root about 50% of the time. It often just returns x0, significantly biasing my results. There is no closed form for the integral of pdf, so I am forced to integrate numerically and feel that this might be introducing some inaccuracy?
EDIT:
This has since been solved using a method other than that described below, but I'd like to get quadpy to work and see if the results improve at all. The specific code I'm trying to get to work is as follows:
import quadpy
import numpy as np
from scipy.optimize import *
from scipy.special import gammaln, kv, gammaincinv, gamma
from scipy.integrate import quad, simps
l = 226.02453163
mu = 0.00212571582056
nu = 4.86569872444
flag = 2.5e-09
estimate = 3 * mu
def pdf(l, mu, nu, t):
return np.exp(np.log(2) + (l + nu - 1 + 1) / 2 * np.log(l * nu / mu) + (l + nu - 1 - 1) / 2 * np.log(t) + np.log(
kv(nu - l, 2 * np.sqrt(l * nu / mu * t))) - gammaln(l) - gammaln(nu))
def tail_cdf(l, mu, nu, tau):
i, error = quadpy.line_segment.adaptive_integrate(
lambda t: pdf(l, mu, nu, t), [tau, 10000], 1.0e-10
)
return i
result = fsolve(lambda tau: flag - tail_cdf(l, mu, nu, tau[0]), estimate)
When I run this I get an assertion error from assert all(lengths > minimum_interval_length). I'm not quite sure of how to remedy this; any help would be very much appreciated!
As an example, I tried 1 / x for the integration between 1 and alpha to retrieve the target integral 2.0. This
import quadpy
from scipy.optimize import fsolve
def f(alpha):
beta, _ = quadpy.quad(lambda x: 1.0/x, 1, alpha)
return beta
target = 2.0
res = fsolve(lambda alpha: target - f(alpha), x0=2.0)
print(res)
correctly returns 7.38905611.
The failing quadpy assertion
assert all(lengths > minimum_interval_length)
you're getting means that the adaptive integration hit its limit: Either relax your tolerance a bit, or decrease the minimum_interval_length (see here).

Python lognorm.cdf vs. formula based implementation not matching

Okay I am converting the scipy.stats.lognorm.cdf function over to a Cython function and using the formula here: http://www.cs.unitn.it/~taufer/SR/P-LN.pdf as 1/2 + 1/2* erf((ln(x)-mu)/sigma*sqrt(2). The results don't match, despite many other references to the same function online. EDIT: just fixed, only had to do np.log(mu) 2x ... Fixed code:
import numpy as np
from scipy.stats import lognorm
from scipy.special import erf
def lognormcdf(x, mu, sigma):
return 0.5 + 0.5*erf((np.log(x)-np.log(mu))/(np.sqrt(2.0)*sigma))
mu = 3.85
sigma = 0.346
x = [-9.997137267734412802e-01,-9.984919506395958377e-01,-9.962951347331251428e-01,-9.931249370374434227e-01,-9.889843952429917540e-01,-9.838775407060570410e-01,-9.778093584869183008e-01,-9.707857757637063933e-01,-9.628136542558155542e-01,-9.539007829254917414e-01,-9.440558701362560257e-01,-9.332885350430795146e-01,-9.216092981453339883e-01,-9.090295709825296777e-01,-8.955616449707269888e-01,-8.812186793850184108e-01,-8.660146884971646752e-01,-8.499645278795913139e-01,-8.330838798884008245e-01,-8.153892383391762033e-01,-7.968978923903144995e-01,-7.776279096494954635e-01,-7.575981185197071532e-01,-7.368280898020207470e-01,-7.153381175730564312e-01,-6.931491993558019926e-01,-6.702830156031409636e-01,-6.467619085141292912e-01,-6.226088602037077591e-01,-5.978474702471787694e-01,-5.725019326213811599e-01,-5.465970120650941455e-01,-5.201580198817630230e-01,-4.932107892081909473e-01,-4.657816497733580086e-01,-4.378974021720314913e-01,-4.095852916783015440e-01,-3.808729816246299582e-01,-3.517885263724216949e-01,-3.223603439005291449e-01,-2.926171880384719759e-01,-2.625881203715034751e-01,-2.323024818449739570e-01,-2.017898640957360157e-01,-1.710800805386032686e-01,-1.402031372361139672e-01,-1.091892035800611088e-01,-7.806858281343663497e-02,-4.687168242159163445e-02,-1.562898442154308370e-02,1.562898442154308370e-02,4.687168242159163445e-02,7.806858281343663497e-02,1.091892035800611088e-01,1.402031372361139672e-01,1.710800805386032686e-01,.017898640957360157e-01,2.323024818449739570e-01,2.625881203715034751e-01,2.926171880384719759e-01,3.223603439005291449e-01,3.517885263724216949e-01,3.808729816246299582e-01,4.095852916783015440e-01,4.378974021720314913e-01,4.657816497733580086e-01,4.932107892081909473e-01,5.201580198817630230e-01,5.465970120650941455e-01,5.725019326213811599e-01,5.978474702471787694e-01,6.226088602037077591e-01,6.467619085141292912e-01,6.702830156031409636e-01,6.931491993558019926e-01,7.153381175730564312e-01,7.368280898020207470e-01,7.575981185197071532e-01,7.776279096494954635e-01,7.968978923903144995e-01,8.153892383391762033e-01,8.330838798884008245e-01,8.499645278795913139e-01,8.660146884971646752e-01,8.812186793850184108e-01,8.955616449707269888e-01,9.090295709825296777e-01,9.216092981453339883e-01,9.332885350430795146e-01,9.440558701362560257e-01,9.539007829254917414e-01,9.628136542558155542e-01,9.707857757637063933e-01,9.778093584869183008e-01,9.838775407060570410e-01,9.889843952429917540e-01,9.931249370374434227e-01,9.962951347331251428e-01,9.984919506395958377e-01,9.997137267734412802e-01]
mycdf = lognormcdf(x, np.log(mu), sigma)
scipycdf = lognorm.cdf(x, scale=np.log(mu), s=sigma)
# This line comparing the Scipy function and mine displays the results below
np.sum(np.nan_to_num(mycdf)-scipycdf)
Results:
1.2011928779531548e-15
The original post was edited to reflect the correct formula.
def lognormcdf(x, mu, sigma):
return 0.5 + 0.5*erf((np.log(x)-np.log(mu))/(np.sqrt(2.0)*sigma))
Pass np.log(mu) in for mu and it works.

Fitting a variable Sinc function in python

I would like to fit a sinc function to a bunch of datalines.
Using a gauss the fit itself does work but the data does not seem to be sufficiently gaussian, so I figured I could just switch to sinc..
I just tried to put together a short piece of self running code but realized, that I probably do not fully understand, how arrays are handled if handed over to a function, which could be part of the reason, why I get error messages calling my program
So my code currently looks as follows:
from numpy import exp
from scipy.optimize import curve_fit
from math import sin, pi
def gauss(x,*p):
print(p)
A, mu, sigma = p
return A*exp(-1*(x[:]-mu)*(x[:]-mu)/sigma/sigma)
def sincSquare_mod(x,*p):
A, mu, sigma = p
return A * (sin(pi*(x[:]-mu)*sigma) / (pi*(x[:]-mu)*sigma))**2
p0 = [1., 30., 5.]
xpos = range(100)
fitdata = gauss(xpos,p0)
p1, var_matrix = curve_fit(sincSquare_mod, xpos, fitdata, p0)
What I get is:
Traceback (most recent call last):
File "orthogonal_fit_test.py", line 18, in <module>
fitdata = gauss(xpos,p0)
File "orthogonal_fit_test.py", line 7, in gauss
A, mu, sigma = p
ValueError: need more than 1 value to unpack
From my understanding p is not handed over correctly, which is odd, because it is in my actual code. I then get a similar message from the sincSquare function, when fitted, which could probably be the same type of error. I am fairly new to the star operator, so there might be a glitch hidden...
Anybody some ideas? :)
Thanks!
You need to make three changes,
def gauss(x, A, mu, sigma):
return A*exp(-1*(x[:]-mu)*(x[:]-mu)/sigma/sigma)
def sincSquare_mod(x, A, mu, sigma):
x=np.array(x)
return A * (np.sin(pi*(x[:]-mu)*sigma) / (pi*(x[:]-mu)*sigma))**2
fitdata = gauss(xpos,*p0)
1, See Documentation
2, replace sin by the numpy version for array broadcasting
3, straight forward right? :P
Note, i think you are looking for p1, var_matrix = curve_fit(gauss,... rather than the one in the OP, which appears do not have a solution.
Also worth noting is that you will get rounding errors as x*Pi gets close to zero that might get magnified. You can approximate as demonstrated below for better results (VB.NET, sorry):
Private Function sinc(x As Double) As Double
x = (x * Math.PI)
'The Taylor Series expansion of Sin(x)/x is used to limit rounding errors for small values of x
If x < 0.01 And x > -0.01 Then
Return 1.0 - x ^ 2 / 6.0 + x ^ 4 / 120.0
End If
Return Math.Sin(x) / x
End Function
http://www.wolframalpha.com/input/?i=taylor+series+sin+%28x%29+%2F+x&dataset=&equal=Submit

How to calculate cumulative normal distribution?

I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.
Here's an example:
>>> from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
In other words, approximately 95% of the standard normal interval lies within two standard deviations, centered on a standard mean of zero.
If you need the inverse CDF:
>>> norm.ppf(norm.cdf(1.96))
array(1.9599999999999991)
It may be too late to answer the question but since Google still leads people here, I decide to write my solution here.
That is, since Python 2.7, the math library has integrated the error function math.erf(x)
The erf() function can be used to compute traditional statistical functions such as the cumulative standard normal distribution:
from math import *
def phi(x):
#'Cumulative distribution function for the standard normal distribution'
return (1.0 + erf(x / sqrt(2.0))) / 2.0
Ref:
https://docs.python.org/2/library/math.html
https://docs.python.org/3/library/math.html
How are the Error Function and Standard Normal distribution function related?
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the cumulative distribution function (cdf - probability that a random sample X will be less than or equal to x) for a given mean (mu) and standard deviation (sigma):
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
NormalDist().cdf(1.96)
# 0.9750021048517796
NormalDist().cdf(-1.96)
# 0.024997895148220428
Adapted from here http://mail.python.org/pipermail/python-list/2000-June/039873.html
from math import *
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def ncdf(x):
return 1. - 0.5*erfcc(x/(2**0.5))
To build upon Unknown's example, the Python equivalent of the function normdist() implemented in a lot of libraries would be:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Alex's answer shows you a solution for standard normal distribution (mean = 0, standard deviation = 1). If you have normal distribution with mean and std (which is sqr(var)) and you want to calculate:
from scipy.stats import norm
# cdf(x < val)
print norm.cdf(val, m, s)
# cdf(x > val)
print 1 - norm.cdf(val, m, s)
# cdf(v1 < x < v2)
print norm.cdf(v2, m, s) - norm.cdf(v1, m, s)
Read more about cdf here and scipy implementation of normal distribution with many formulas here.
Taken from above:
from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
For a two-tailed test:
Import numpy as np
z = 1.96
p_value = 2 * norm.cdf(-np.abs(z))
0.04999579029644087
Simple like this:
import math
def my_cdf(x):
return 0.5*(1+math.erf(x/math.sqrt(2)))
I found the formula in this page https://www.danielsoper.com/statcalc/formulas.aspx?id=55

Categories

Resources