alternative parametrization of the negative binomial in scipy - python

In scipy the negative binomial distribution is defined as:
nbinom.pmf(k) = choose(k+n-1, n-1) * p**n * (1-p)**k
This is the common definition, see also wikipedia:
https://en.wikipedia.org/wiki/Negative_binomial_distribution
However, there exists a different parametrization where the negative Binomial is defined by the mean mu and the dispersion parameter.
In R this is easy, as the negbin can be defined by both parametrizations:
dnbinom(x, size, prob, mu, log = FALSE)
How can I use the mean/dispersion parametrization in scipy ?
edit:
straight from the R help:
The negative binomial distribution with size = n and prob = p has density
Γ(x+n)/(Γ(n) x!) p^n (1-p)^x
An alternative parametrization (often used in ecology) is by the mean mu (see above), and size, the dispersion parameter, where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization.
It is also describe here in more detail:
https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations

from scipy.stats import nbinom
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
def pmf(counts, mu, theta):
"""
>>> import numpy as np
>>> from scipy.stats import poisson
>>> np.isclose(pmf(10, 10, 10000), poisson.pmf(10, 10), atol=1e-3)
True
"""
return nbinom.pmf(counts, *convert_params(mu, theta))
def logpmf(counts, mu, theta):
return nbinom.logpmf(counts, *convert_params(mu, theta))
def cdf(counts, mu, theta):
return nbinom.cdf(counts, *convert_params(mu, theta))
def sf(counts, mu, theta):
return nbinom.sf(counts, *convert_params(mu, theta))

The Wikipedia page you linked given a precise formula for p and r in terms of mu and sigma, see the very last bullet item in the Alternative parametrization section,https://en.m.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations

Related

Implementation of the second derivative of a normal probability distribution function in python

IF :
The PDF of the normal distribution is:
scipy.stats.norm.pdf(x, mu, sigma)
Its first derivative with respect to x would be:
scipy.stats.norm.pdf(x, mu, sigma)*(mu - x)/sigma**2
What would be the second derivation?
You can apply the product rule
f(x)*g(x) = f(x)*g'(x) + f'(x)*g(x)
Where f(x) = pdf(x, mu, sigma), and g(x)=(mu-x)/sigma**2.
Then f'(x) = f(x) * g(x)
And g'(x) = -1/sigma**2
Putting all to gether you have the second derivative of the PDF as
def second_derivative(x, mu, sigma):
g = (mu - x)**2/sigma**2;
return scipy.stats.norm.pdf(x, mu, sigma)*(g**2 - 1/sigma**2)

parameterization of the negative binomial in scipy via mean and std

I am trying to fit my data to a Negative Binomial Distribution with the package scipy in Python. However, my validation seems to fail.
These are my steps:
I have some demand data which is described by the statistics:
mu = 1.4
std = 1.59
print(mu, std)
I use the parameterization function below, taken from this post to compute the two NB parameters.
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
I pass (hopefully correctly...) my two statistics - the naming convention between different sources is rather confusing at this point p, r, k
firstParam, secondParam = convert_params(mu, std)
I would then use these two parameters to fit the distribution:
from scipy.stats import nbinom
rv = nbinom(firstParam, secondParam)
Then I calculate a value R with the Percent Point Function .ppf(0.95). The value R in the context of my problem is a Reorder Point.
R = rv.ppf(0.95)
Now is when I expect to validate the previous steps, but I do not manage to retrieve my original statistics mu and std with mean and math.sqrt(var) respectively.
import math
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print(mean, math.sqrt(var))
What am I missing? Any feedback about the parameterization implemented in Scipy?
Conversion code is wrong, I believe, SciPy is NOT using Wiki convention, but Mathematica convention
#%%
import numpy as np
from scipy.stats import nbinom
def convert_params(mean, std):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://mathworld.wolfram.com/NegativeBinomialDistribution.html
"""
p = mean/std**2
n = mean*p/(1.0 - p)
return n, p
mean = 1.4
std = 1.59
n, p = convert_params(mean, std)
print((n, p))
#%%
m, v = nbinom.stats(n, p, moments='mv')
print(m, np.sqrt(v))
Code prints back 1.4, 1.59 pair
And reorder point computed as
rv = nbinom(n, p)
print("reorder point:", rv.ppf(0.95))
outputs 5
It looks like you are using a different conversion. The last bullet at the cited wikipedia section gives the formulas shown below. With these formulas you get back the exact same mu and std:
import numpy as np
from scipy.stats import nbinom
def convert_mu_std_to_r_p(mu, std):
r = mu ** 2 / (std ** 2 - mu)
p = 1 - mu / std ** 2
return r, 1 - p
mu = 1.4
std = 1.59
print("mu, std:", mu, std)
firstParam, secondParam = convert_mu_std_to_r_p(mu, std)
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print("mean, sqrt(var):", mean, np.sqrt(var))
rv = nbinom(firstParam, secondParam)
print("reorder point:", rv.ppf(0.95))
Output:
mu, std: 1.4 1.59
mean, sqrt(var): 1.4 1.59
reorder point: 5.0

The generalized Student-T probability distribution I coded in Python doesn't integrate to 1 (in some cases)

I've been trying to implement the skewed generalized t distribution in Python to model some financial returns. I based my code on formulas found on Wikipedia, and I used the Beta distribution from scipy.
from scipy.special import beta
import numpy as np
from math import sqrt
def sgt(x, params):
# This function accepts an array of 5 parameters [mu, sigma, lambda, p, q]
mu, sigma, lam, p, q = params
v = (q**(-1/p)) / (sqrt((3*lam*lam + 1)*beta(3/p, q-2/p)/beta(1/p, q) - 4*lam*lam*(beta(2/p, q-1/p)/(beta(1/p, q)))**2))
m = 2*v*sigma*lam*q**(1/p)*beta(2/p, q - 1/p) / beta(1/p, q)
fx = p / (2*v*sigma*(q**(1/p))*beta(1/p, q)*((abs(x-mu+m)**p/(q*(v*sigma)**p*(lam*np.sign(x-mu+m)+1)**p + 1)+1)**(1/p + q)))
return fx
Now, the function seems to work perfectly fine for some sets of parameters, but terribly for other sets of parameters.
For example:
dx = 0.001
x_axis = np.arange(-10, 10, dx)
ok_parameters = [0, 2, 0, 3, 8]
bad_parameters = [0, 2, 0, 1.05, 2.1]
ok_distribution = sgt(x_axis, ok_parameters)
bad_distribution = sgt(x_axis, bad_parameters)
If I try to compute the integrals of those two numbers:
a = np.sum(ok_distribution*dx)
b = np.sum(bad_distribution*dx)
I obtain the results a = 1.0013233154393804 and b = 2.2799746093533346.
Now, in theory both of these should be 1, but I assume since I approximated the integral the value won't always be exactly 1. In the second case however I don't understand why the value is so high.
Does anyone know what the issue is?
These are the graphs of the ok distribution (blue) and bad distribution (orange)
I believe there was just a typo (though I couldn't exactly find where) in your definition sgt. Here is an implementation that works.
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.special import beta
import numpy as np
from math import sqrt
from typing import Union
from scipy import integrate
# Generalised Student T probability Distribution
def generalized_student_t(x:Union[float, np.ndarray], mu:float, sigma:float,
lam:float, p:float, q:float) \
-> Union[float, np.ndarray]:
v = q**(-1/p) * ((3*lam**2 + 1)*(beta(3/p, q - 2/p)/beta(1/p,q)) - 4*lam**2*(beta(2/p, q - 1/p)/beta(1/p,q))**2)**(-1/2)
m = 2*v*sigma*lam*q**(1/p)*beta(2/p,q - 1/p)/beta(1/p,q)
fx = p / (2*v*sigma*q**(1/p)*beta(1/p,q)*(abs(x-mu+m)**p/(q*(v*sigma)**p)*(lam*np.sign(x-mu+m)+1)**p + 1)**(1/p + q))
return fx
def plot_cdf_pdf(x_axis:np.ndarray, pmf:np.ndarray) -> None:
"""
Plot the PDF and CDF of the array returned from the function.
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.plot(x_axis, pmf)
ax1.set_title('PDF')
ax2.plot(x_axis, integrate.cumtrapz(x=x_axis, y=pmf, initial = 0))
ax2.set_title('CDF')
pass
dx = 0.0001
x_axis = np.arange(-10, 10, dx)
# Create the Two
distribution1 = generalized_student_t(x=x_axis, mu=0, sigma=1, lam=0, p=2, q=100)
distribution2 = generalized_student_t(x=x_axis, mu=0, sigma=2, lam=0, p=1.05, q=2.1)
plot_cdf_pdf(x_axis=x_axis, pmf=distribution1)
plot_cdf_pdf(x_axis=x_axis, pmf=distribution2)
We can also check that the integral of the PDFs are 1
integrate.simps(x=x_axis, y = distribution1)
integrate.simps(x=x_axis, y = distribution2)
We can see the results of the integral are 0.99999999999999978 and 0.99752026308335162. The reason they are not exactly 1 is due the CDF being defined as integral from -infinity to infinity of the PDF.

Improving accuracy in scipy.optimize.fsolve with equations involving integration

I'm trying to solve an integral equation using the following code (irrelevant parts removed):
def _pdf(self, a, b, c, t):
pdf = some_pdf(a,b,c,t)
return pdf
def _result(self, a, b, c, flag):
return fsolve(lambda t: flag - 1 + quad(lambda tau: self._pdf(a, b, c, tau), 0, t)[0], x0)[0]
Which takes a probability density function and finds a result tau such that the integral of pdf from tau to infinity is equal to flag. Note that x0 is a (float) estimate of the root defined elsewhere in the script. Also note that flag is an extremely small number, on the order of 1e-9.
In my application fsolve only successfully finds a root about 50% of the time. It often just returns x0, significantly biasing my results. There is no closed form for the integral of pdf, so I am forced to integrate numerically and feel that this might be introducing some inaccuracy?
EDIT:
This has since been solved using a method other than that described below, but I'd like to get quadpy to work and see if the results improve at all. The specific code I'm trying to get to work is as follows:
import quadpy
import numpy as np
from scipy.optimize import *
from scipy.special import gammaln, kv, gammaincinv, gamma
from scipy.integrate import quad, simps
l = 226.02453163
mu = 0.00212571582056
nu = 4.86569872444
flag = 2.5e-09
estimate = 3 * mu
def pdf(l, mu, nu, t):
return np.exp(np.log(2) + (l + nu - 1 + 1) / 2 * np.log(l * nu / mu) + (l + nu - 1 - 1) / 2 * np.log(t) + np.log(
kv(nu - l, 2 * np.sqrt(l * nu / mu * t))) - gammaln(l) - gammaln(nu))
def tail_cdf(l, mu, nu, tau):
i, error = quadpy.line_segment.adaptive_integrate(
lambda t: pdf(l, mu, nu, t), [tau, 10000], 1.0e-10
)
return i
result = fsolve(lambda tau: flag - tail_cdf(l, mu, nu, tau[0]), estimate)
When I run this I get an assertion error from assert all(lengths > minimum_interval_length). I'm not quite sure of how to remedy this; any help would be very much appreciated!
As an example, I tried 1 / x for the integration between 1 and alpha to retrieve the target integral 2.0. This
import quadpy
from scipy.optimize import fsolve
def f(alpha):
beta, _ = quadpy.quad(lambda x: 1.0/x, 1, alpha)
return beta
target = 2.0
res = fsolve(lambda alpha: target - f(alpha), x0=2.0)
print(res)
correctly returns 7.38905611.
The failing quadpy assertion
assert all(lengths > minimum_interval_length)
you're getting means that the adaptive integration hit its limit: Either relax your tolerance a bit, or decrease the minimum_interval_length (see here).

Develop the Python function normdist(x, mu ,sigma)

I do not understand this homework question i have received,our task is to develop the Python function normdist(x,mu,sigma) , which evaluates the multivariate
Gaussian probability density function for the k ­dimensional vector x , the mean vector μ and the covariance
matrix Σ . In the special case where k = 1 , this function evaluates the univariate Gaussian probability
density function for the scalar x , the mean μ and the standard deviation σ .
My attempt is below:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)**k*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, k):
if k:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Above code credited to Cerin
How do i handle the case of k =1 ?

Categories

Resources