I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.
Here's an example:
>>> from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
In other words, approximately 95% of the standard normal interval lies within two standard deviations, centered on a standard mean of zero.
If you need the inverse CDF:
>>> norm.ppf(norm.cdf(1.96))
array(1.9599999999999991)
It may be too late to answer the question but since Google still leads people here, I decide to write my solution here.
That is, since Python 2.7, the math library has integrated the error function math.erf(x)
The erf() function can be used to compute traditional statistical functions such as the cumulative standard normal distribution:
from math import *
def phi(x):
#'Cumulative distribution function for the standard normal distribution'
return (1.0 + erf(x / sqrt(2.0))) / 2.0
Ref:
https://docs.python.org/2/library/math.html
https://docs.python.org/3/library/math.html
How are the Error Function and Standard Normal distribution function related?
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the cumulative distribution function (cdf - probability that a random sample X will be less than or equal to x) for a given mean (mu) and standard deviation (sigma):
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
NormalDist().cdf(1.96)
# 0.9750021048517796
NormalDist().cdf(-1.96)
# 0.024997895148220428
Adapted from here http://mail.python.org/pipermail/python-list/2000-June/039873.html
from math import *
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def ncdf(x):
return 1. - 0.5*erfcc(x/(2**0.5))
To build upon Unknown's example, the Python equivalent of the function normdist() implemented in a lot of libraries would be:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Alex's answer shows you a solution for standard normal distribution (mean = 0, standard deviation = 1). If you have normal distribution with mean and std (which is sqr(var)) and you want to calculate:
from scipy.stats import norm
# cdf(x < val)
print norm.cdf(val, m, s)
# cdf(x > val)
print 1 - norm.cdf(val, m, s)
# cdf(v1 < x < v2)
print norm.cdf(v2, m, s) - norm.cdf(v1, m, s)
Read more about cdf here and scipy implementation of normal distribution with many formulas here.
Taken from above:
from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
For a two-tailed test:
Import numpy as np
z = 1.96
p_value = 2 * norm.cdf(-np.abs(z))
0.04999579029644087
Simple like this:
import math
def my_cdf(x):
return 0.5*(1+math.erf(x/math.sqrt(2)))
I found the formula in this page https://www.danielsoper.com/statcalc/formulas.aspx?id=55
Related
I am trying to fit my data to a Negative Binomial Distribution with the package scipy in Python. However, my validation seems to fail.
These are my steps:
I have some demand data which is described by the statistics:
mu = 1.4
std = 1.59
print(mu, std)
I use the parameterization function below, taken from this post to compute the two NB parameters.
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
I pass (hopefully correctly...) my two statistics - the naming convention between different sources is rather confusing at this point p, r, k
firstParam, secondParam = convert_params(mu, std)
I would then use these two parameters to fit the distribution:
from scipy.stats import nbinom
rv = nbinom(firstParam, secondParam)
Then I calculate a value R with the Percent Point Function .ppf(0.95). The value R in the context of my problem is a Reorder Point.
R = rv.ppf(0.95)
Now is when I expect to validate the previous steps, but I do not manage to retrieve my original statistics mu and std with mean and math.sqrt(var) respectively.
import math
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print(mean, math.sqrt(var))
What am I missing? Any feedback about the parameterization implemented in Scipy?
Conversion code is wrong, I believe, SciPy is NOT using Wiki convention, but Mathematica convention
#%%
import numpy as np
from scipy.stats import nbinom
def convert_params(mean, std):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://mathworld.wolfram.com/NegativeBinomialDistribution.html
"""
p = mean/std**2
n = mean*p/(1.0 - p)
return n, p
mean = 1.4
std = 1.59
n, p = convert_params(mean, std)
print((n, p))
#%%
m, v = nbinom.stats(n, p, moments='mv')
print(m, np.sqrt(v))
Code prints back 1.4, 1.59 pair
And reorder point computed as
rv = nbinom(n, p)
print("reorder point:", rv.ppf(0.95))
outputs 5
It looks like you are using a different conversion. The last bullet at the cited wikipedia section gives the formulas shown below. With these formulas you get back the exact same mu and std:
import numpy as np
from scipy.stats import nbinom
def convert_mu_std_to_r_p(mu, std):
r = mu ** 2 / (std ** 2 - mu)
p = 1 - mu / std ** 2
return r, 1 - p
mu = 1.4
std = 1.59
print("mu, std:", mu, std)
firstParam, secondParam = convert_mu_std_to_r_p(mu, std)
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print("mean, sqrt(var):", mean, np.sqrt(var))
rv = nbinom(firstParam, secondParam)
print("reorder point:", rv.ppf(0.95))
Output:
mu, std: 1.4 1.59
mean, sqrt(var): 1.4 1.59
reorder point: 5.0
I have the following code below that prints the PDF graph for a particular mean and standard deviation.
http://imgur.com/a/oVgML
Now I need to find the actual probability, of a particular value. So for example if my mean is 0, and my value is 0, my probability is 1. This is usually done by calculating the area under the curve. Similar to this:
http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
I am not sure how to approach this problem
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
def normal(power, mean, std, val):
a = 1/(np.sqrt(2*np.pi)*std)
diff = np.abs(np.power(val-mean, power))
b = np.exp(-(diff)/(2*std*std))
return a*b
pdf_array = []
array = np.arange(-2,2,0.1)
print array
for i in array:
print i
pdf = normal(2, 0, 0.1, i)
print pdf
pdf_array.append(pdf)
plt.plot(array, pdf_array)
plt.ylabel('some numbers')
plt.axis([-2, 2, 0, 5])
plt.show()
print
Unless you have a reason to implement this yourself. All these functions are available in scipy.stats.norm
I think you asking for the cdf, then use this code:
from scipy.stats import norm
print(norm.cdf(x, mean, std))
If you want to write it from scratch:
class PDF():
def __init__(self,mu=0, sigma=1):
self.mean = mu
self.stdev = sigma
self.data = []
def calculate_mean(self):
self.mean = sum(self.data) // len(self.data)
return self.mean
def calculate_stdev(self,sample=True):
if sample:
n = len(self.data)-1
else:
n = len(self.data)
mean = self.mean
sigma = 0
for el in self.data:
sigma += (el - mean)**2
sigma = math.sqrt(sigma / n)
self.stdev = sigma
return self.stdev
def pdf(self, x):
return (1.0 / (self.stdev * math.sqrt(2*math.pi))) * math.exp(-0.5*((x - self.mean) / self.stdev) ** 2)
The area under a curve y = f(x) from x = a to x = b is the same as the integral of f(x)dx from x = a to x = b. Scipy has a quick easy way to do integrals. And just so you understand, the probability of finding a single point in that area cannot be one because the idea is that the total area under the curve is one (unless MAYBE it's a delta function). So you should get 0 ≤ probability of value < 1 for any particular value of interest. There may be different ways of doing it, but a conventional way is to assign confidence intervals along the x-axis like this. I would read up on Gaussian curves and normalization before continuing to code it.
I am trying to evaluate the density of multivariate t distribution of a 13-d vector. Using the dmvt function from the mvtnorm package in R, the result I get is
[1] 1.009831e-13
When i tried to write the function by myself in Python (thanks to the suggestions in this post:
multivariate student t-distribution with python), I realized that the gamma function was taking very high values (given the fact that I have n=7512 observations), making my function going out of range.
I tried to modify the algorithm, using the math.lgamma() and np.linalg.slogdet() functions to transform it to the log scale, but the result I got was
8.97669876e-15
This is the function that I used in python is the following:
def dmvt(x,mu,Sigma,df,d):
'''
Multivariate t-student density:
output:
the density of the given element
input:
x = parameter (d dimensional numpy array or scalar)
mu = mean (d dimensional numpy array or scalar)
Sigma = scale matrix (dxd numpy array)
df = degrees of freedom
d: dimension
'''
Num = math.lgamma( 1. *(d+df)/2 ) - math.lgamma( 1.*df/2 )
(sign, logdet) = np.linalg.slogdet(Sigma)
Denom =1/2*logdet + d/2*( np.log(pi)+np.log(df) ) + 1.*( (d+df)/2 )*np.log(1 + (1./df)*np.dot(np.dot((x - mu),np.linalg.inv(Sigma)), (x - mu)))
d = 1. * (Num - Denom)
return np.exp(d)
Any ideas why this functions does not produce the same results as the R equivalent?
Using as x = (0,0) produces similar results (up to a point, die to rounding) but with x = (1,1)1 I get a significant difference!
I finally managed to 'translate' the code from the mvtnorm package in R and the following script works without numerical underflows.
import numpy as np
import scipy.stats
import math
from math import lgamma
from numpy import matrix
from numpy import linalg
from numpy.linalg import slogdet
import scipy.special
from scipy.special import gammaln
mu = np.array([3,3])
x = np.array([1, 1])
Sigma = np.array([[1, 0], [0, 1]])
p=2
df=1
def dmvt(x, mu, Sigma, df, log):
'''
Multivariate t-student density. Returns the density
of the function at points specified by x.
input:
x = parameter (n x d numpy array)
mu = mean (d dimensional numpy array)
Sigma = scale matrix (d x d numpy array)
df = degrees of freedom
log = log scale or not
'''
p = Sigma.shape[0] # Dimensionality
dec = np.linalg.cholesky(Sigma)
R_x_m = np.linalg.solve(dec,np.matrix.transpose(x)-mu)
rss = np.power(R_x_m,2).sum(axis=0)
logretval = lgamma(1.0*(p + df)/2) - (lgamma(1.0*df/2) + np.sum(np.log(dec.diagonal())) \
+ p/2 * np.log(math.pi * df)) - 0.5 * (df + p) * math.log1p((rss/df) )
if log == False:
return(np.exp(logretval))
else:
return(logretval)
print(dmvt(x,mu,Sigma,df,True))
print(dmvt(x,mu,Sigma,df,False))
Okay I am converting the scipy.stats.lognorm.cdf function over to a Cython function and using the formula here: http://www.cs.unitn.it/~taufer/SR/P-LN.pdf as 1/2 + 1/2* erf((ln(x)-mu)/sigma*sqrt(2). The results don't match, despite many other references to the same function online. EDIT: just fixed, only had to do np.log(mu) 2x ... Fixed code:
import numpy as np
from scipy.stats import lognorm
from scipy.special import erf
def lognormcdf(x, mu, sigma):
return 0.5 + 0.5*erf((np.log(x)-np.log(mu))/(np.sqrt(2.0)*sigma))
mu = 3.85
sigma = 0.346
x = [-9.997137267734412802e-01,-9.984919506395958377e-01,-9.962951347331251428e-01,-9.931249370374434227e-01,-9.889843952429917540e-01,-9.838775407060570410e-01,-9.778093584869183008e-01,-9.707857757637063933e-01,-9.628136542558155542e-01,-9.539007829254917414e-01,-9.440558701362560257e-01,-9.332885350430795146e-01,-9.216092981453339883e-01,-9.090295709825296777e-01,-8.955616449707269888e-01,-8.812186793850184108e-01,-8.660146884971646752e-01,-8.499645278795913139e-01,-8.330838798884008245e-01,-8.153892383391762033e-01,-7.968978923903144995e-01,-7.776279096494954635e-01,-7.575981185197071532e-01,-7.368280898020207470e-01,-7.153381175730564312e-01,-6.931491993558019926e-01,-6.702830156031409636e-01,-6.467619085141292912e-01,-6.226088602037077591e-01,-5.978474702471787694e-01,-5.725019326213811599e-01,-5.465970120650941455e-01,-5.201580198817630230e-01,-4.932107892081909473e-01,-4.657816497733580086e-01,-4.378974021720314913e-01,-4.095852916783015440e-01,-3.808729816246299582e-01,-3.517885263724216949e-01,-3.223603439005291449e-01,-2.926171880384719759e-01,-2.625881203715034751e-01,-2.323024818449739570e-01,-2.017898640957360157e-01,-1.710800805386032686e-01,-1.402031372361139672e-01,-1.091892035800611088e-01,-7.806858281343663497e-02,-4.687168242159163445e-02,-1.562898442154308370e-02,1.562898442154308370e-02,4.687168242159163445e-02,7.806858281343663497e-02,1.091892035800611088e-01,1.402031372361139672e-01,1.710800805386032686e-01,.017898640957360157e-01,2.323024818449739570e-01,2.625881203715034751e-01,2.926171880384719759e-01,3.223603439005291449e-01,3.517885263724216949e-01,3.808729816246299582e-01,4.095852916783015440e-01,4.378974021720314913e-01,4.657816497733580086e-01,4.932107892081909473e-01,5.201580198817630230e-01,5.465970120650941455e-01,5.725019326213811599e-01,5.978474702471787694e-01,6.226088602037077591e-01,6.467619085141292912e-01,6.702830156031409636e-01,6.931491993558019926e-01,7.153381175730564312e-01,7.368280898020207470e-01,7.575981185197071532e-01,7.776279096494954635e-01,7.968978923903144995e-01,8.153892383391762033e-01,8.330838798884008245e-01,8.499645278795913139e-01,8.660146884971646752e-01,8.812186793850184108e-01,8.955616449707269888e-01,9.090295709825296777e-01,9.216092981453339883e-01,9.332885350430795146e-01,9.440558701362560257e-01,9.539007829254917414e-01,9.628136542558155542e-01,9.707857757637063933e-01,9.778093584869183008e-01,9.838775407060570410e-01,9.889843952429917540e-01,9.931249370374434227e-01,9.962951347331251428e-01,9.984919506395958377e-01,9.997137267734412802e-01]
mycdf = lognormcdf(x, np.log(mu), sigma)
scipycdf = lognorm.cdf(x, scale=np.log(mu), s=sigma)
# This line comparing the Scipy function and mine displays the results below
np.sum(np.nan_to_num(mycdf)-scipycdf)
Results:
1.2011928779531548e-15
The original post was edited to reflect the correct formula.
def lognormcdf(x, mu, sigma):
return 0.5 + 0.5*erf((np.log(x)-np.log(mu))/(np.sqrt(2.0)*sigma))
Pass np.log(mu) in for mu and it works.
I can implement the error function, erf, myself, but I'd prefer not to. Is there a python package with no external dependencies that contains an implementation of this function? I have found this but this seems to be part of some much larger package (and it's not even clear which one!).
Since v.2.7. the standard math module contains erf function. This should be the easiest way.
http://docs.python.org/2/library/math.html#math.erf
I recommend SciPy for numerical functions in Python, but if you want something with no dependencies, here is a function with an error error is less than 1.5 * 10-7 for all inputs.
def erf(x):
# save the sign of x
sign = 1 if x >= 0 else -1
x = abs(x)
# constants
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
p = 0.3275911
# A&S formula 7.1.26
t = 1.0/(1.0 + p*x)
y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)
return sign*y # erf(-x) = -erf(x)
The algorithm comes from Handbook of Mathematical Functions, formula 7.1.26.
I would recommend you download numpy (to have efficiant matrix in python) and scipy (a Matlab toolbox substitute, which uses numpy). The erf function lies in scipy.
>>>from scipy.special import erf
>>>help(erf)
You can also use the erf function defined in pylab, but this is more intended at plotting the results of the things you compute with numpy and scipy. If you want an all-in-one
installation of these software you can use directly the Python Enthought distribution.
A pure python implementation can be found in the mpmath module (http://code.google.com/p/mpmath/)
From the doc string:
>>> from mpmath import *
>>> mp.dps = 15
>>> print erf(0)
0.0
>>> print erf(1)
0.842700792949715
>>> print erf(-1)
-0.842700792949715
>>> print erf(inf)
1.0
>>> print erf(-inf)
-1.0
For large real x, \mathrm{erf}(x) approaches 1 very
rapidly::
>>> print erf(3)
0.999977909503001
>>> print erf(5)
0.999999999998463
The error function is an odd function::
>>> nprint(chop(taylor(erf, 0, 5)))
[0.0, 1.12838, 0.0, -0.376126, 0.0, 0.112838]
:func:erf implements arbitrary-precision evaluation and
supports complex numbers::
>>> mp.dps = 50
>>> print erf(0.5)
0.52049987781304653768274665389196452873645157575796
>>> mp.dps = 25
>>> print erf(1+j)
(1.316151281697947644880271 + 0.1904534692378346862841089j)
Related functions
See also :func:erfc, which is more accurate for large x,
and :func:erfi which gives the antiderivative of
\exp(t^2).
The Fresnel integrals :func:fresnels and :func:fresnelc
are also related to the error function.
To answer my own question, I have ended up using the following code, adapted from a Java version I found elsewhere on the web:
# from: http://www.cs.princeton.edu/introcs/21function/ErrorFunction.java.html
# Implements the Gauss error function.
# erf(z) = 2 / sqrt(pi) * integral(exp(-t*t), t = 0..z)
#
# fractional error in math formula less than 1.2 * 10 ^ -7.
# although subject to catastrophic cancellation when z in very close to 0
# from Chebyshev fitting formula for erf(z) from Numerical Recipes, 6.2
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
I have a function which does 10^5 erf calls. On my machine...
scipy.special.erf makes it time at 6.1s
erf Handbook of Mathematical Functions takes 8.3s
erf Numerical Recipes 6.2 takes 9.5s
(three-run averages, code taken from above posters).
One note for those aiming for higher performance: vectorize, if possible.
import numpy as np
from scipy.special import erf
def vectorized(n):
x = np.random.randn(n)
return erf(x)
def loopstyle(n):
x = np.random.randn(n)
return [erf(v) for v in x]
%timeit vectorized(10e5)
%timeit loopstyle(10e5)
gives results
# vectorized
10 loops, best of 3: 108 ms per loop
# loops
1 loops, best of 3: 2.34 s per loop
SciPy has an implementation of the erf function, see scipy.special.erf.
From Python's math.erf function documentation, it uses up to 50 terms in the approximation:
Implementations of the error function erf(x) and the complementary error
function erfc(x).
Method: we use a series approximation for erf for small x, and a continued
fraction approximation for erfc(x) for larger x;
combined with the relations erf(-x) = -erf(x) and erfc(x) = 1.0 - erf(x),
this gives us erf(x) and erfc(x) for all x.
The series expansion used is:
erf(x) = x*exp(-x*x)/sqrt(pi) * [
2/1 + 4/3 x**2 + 8/15 x**4 + 16/105 x**6 + ...]
The coefficient of x**(2k-2) here is 4**k*factorial(k)/factorial(2*k).
This series converges well for smallish x, but slowly for larger x.
The continued fraction expansion used is:
erfc(x) = x*exp(-x*x)/sqrt(pi) * [1/(0.5 + x**2 -) 0.5/(2.5 + x**2 - )
3.0/(4.5 + x**2 - ) 7.5/(6.5 + x**2 - ) ...]
after the first term, the general term has the form:
k*(k-0.5)/(2*k+0.5 + x**2 - ...).
This expansion converges fast for larger x, but convergence becomes
infinitely slow as x approaches 0.0. The (somewhat naive) continued
fraction evaluation algorithm used below also risks overflow for large x;
but for large x, erfc(x) == 0.0 to within machine precision. (For
example, erfc(30.0) is approximately 2.56e-393).
Parameters: use series expansion for abs(x) < ERF_SERIES_CUTOFF and
continued fraction expansion for ERF_SERIES_CUTOFF <= abs(x) <
ERFC_CONTFRAC_CUTOFF. ERFC_SERIES_TERMS and ERFC_CONTFRAC_TERMS are the
numbers of terms to use for the relevant expansions.
#define ERF_SERIES_CUTOFF 1.5
#define ERF_SERIES_TERMS 25
#define ERFC_CONTFRAC_CUTOFF 30.0
#define ERFC_CONTFRAC_TERMS 50
Error function, via power series.
Given a finite float x, return an approximation to erf(x).
Converges reasonably fast for small x.