parameterization of the negative binomial in scipy via mean and std - python

I am trying to fit my data to a Negative Binomial Distribution with the package scipy in Python. However, my validation seems to fail.
These are my steps:
I have some demand data which is described by the statistics:
mu = 1.4
std = 1.59
print(mu, std)
I use the parameterization function below, taken from this post to compute the two NB parameters.
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
I pass (hopefully correctly...) my two statistics - the naming convention between different sources is rather confusing at this point p, r, k
firstParam, secondParam = convert_params(mu, std)
I would then use these two parameters to fit the distribution:
from scipy.stats import nbinom
rv = nbinom(firstParam, secondParam)
Then I calculate a value R with the Percent Point Function .ppf(0.95). The value R in the context of my problem is a Reorder Point.
R = rv.ppf(0.95)
Now is when I expect to validate the previous steps, but I do not manage to retrieve my original statistics mu and std with mean and math.sqrt(var) respectively.
import math
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print(mean, math.sqrt(var))
What am I missing? Any feedback about the parameterization implemented in Scipy?

Conversion code is wrong, I believe, SciPy is NOT using Wiki convention, but Mathematica convention
#%%
import numpy as np
from scipy.stats import nbinom
def convert_params(mean, std):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://mathworld.wolfram.com/NegativeBinomialDistribution.html
"""
p = mean/std**2
n = mean*p/(1.0 - p)
return n, p
mean = 1.4
std = 1.59
n, p = convert_params(mean, std)
print((n, p))
#%%
m, v = nbinom.stats(n, p, moments='mv')
print(m, np.sqrt(v))
Code prints back 1.4, 1.59 pair
And reorder point computed as
rv = nbinom(n, p)
print("reorder point:", rv.ppf(0.95))
outputs 5

It looks like you are using a different conversion. The last bullet at the cited wikipedia section gives the formulas shown below. With these formulas you get back the exact same mu and std:
import numpy as np
from scipy.stats import nbinom
def convert_mu_std_to_r_p(mu, std):
r = mu ** 2 / (std ** 2 - mu)
p = 1 - mu / std ** 2
return r, 1 - p
mu = 1.4
std = 1.59
print("mu, std:", mu, std)
firstParam, secondParam = convert_mu_std_to_r_p(mu, std)
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print("mean, sqrt(var):", mean, np.sqrt(var))
rv = nbinom(firstParam, secondParam)
print("reorder point:", rv.ppf(0.95))
Output:
mu, std: 1.4 1.59
mean, sqrt(var): 1.4 1.59
reorder point: 5.0

Related

Binominal distribution: How to calculate Alpha, so that the probability is covered by the confidence interval?

So, I have the code, that calculates bounds of confidence interval
import statsmodels.api as sm
from statsmodels.stats.proportion import proportion_confint
def bin_conf (k, n, a):
alpha, count, nobs = a, k, n
return proportion_confint(count, nobs, alpha, method='normal')
bin_conf(75, 300, 0.05)
>>> (0.20100090038649865, 0.29899909961350135)
But I need to count the alpha, so that an already defined probability is also covered by an already defined confidence interval ?
For example: Size of trials (n) = 500. Successful trials = 200. CI = [0,35 ; 0,45]. Alpha = ?
Has 'statsmodels' or any other Python library the solution to find out that?
statsmodels does not have a helper function for this because it is not an usual use case.
However, confidence intervals based on normal distribution can be easily inverted. The width of the confidence interval is two times the critical value times the standard deviation of the mean.
from scipy import stats
ci = [0.35, 0.45]
count, nobs = 200, 500
p = count / nobs
std = np.sqrt(p * (1 - p) / nobs)
critval = (ci[1] - ci[0]) / std / 2
alpha = stats.norm.sf(critval) * 2 # two-sided
alpha
0.02247887336612522
Check that it is the same as the "normal" proportion_confint
proportion_confint(count, nobs, alpha, method='normal')
(0.35, 0.45000000000000007)
The same could be used for confidence intervals based on the t-distribution, by replacing stats.norm by stats.t and using appropriate degrees of freedom

alternative parametrization of the negative binomial in scipy

In scipy the negative binomial distribution is defined as:
nbinom.pmf(k) = choose(k+n-1, n-1) * p**n * (1-p)**k
This is the common definition, see also wikipedia:
https://en.wikipedia.org/wiki/Negative_binomial_distribution
However, there exists a different parametrization where the negative Binomial is defined by the mean mu and the dispersion parameter.
In R this is easy, as the negbin can be defined by both parametrizations:
dnbinom(x, size, prob, mu, log = FALSE)
How can I use the mean/dispersion parametrization in scipy ?
edit:
straight from the R help:
The negative binomial distribution with size = n and prob = p has density
Γ(x+n)/(Γ(n) x!) p^n (1-p)^x
An alternative parametrization (often used in ecology) is by the mean mu (see above), and size, the dispersion parameter, where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization.
It is also describe here in more detail:
https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
from scipy.stats import nbinom
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
def pmf(counts, mu, theta):
"""
>>> import numpy as np
>>> from scipy.stats import poisson
>>> np.isclose(pmf(10, 10, 10000), poisson.pmf(10, 10), atol=1e-3)
True
"""
return nbinom.pmf(counts, *convert_params(mu, theta))
def logpmf(counts, mu, theta):
return nbinom.logpmf(counts, *convert_params(mu, theta))
def cdf(counts, mu, theta):
return nbinom.cdf(counts, *convert_params(mu, theta))
def sf(counts, mu, theta):
return nbinom.sf(counts, *convert_params(mu, theta))
The Wikipedia page you linked given a precise formula for p and r in terms of mu and sigma, see the very last bullet item in the Alternative parametrization section,https://en.m.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations

Python: Random number generator with mean and Standard Deviation

I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]

Jensen-Shannon Divergence

I have another question that I was hoping someone could help me with.
I'm using the Jensen-Shannon-Divergence to measure the similarity between two probability distributions. The similarity scores appear to be correct in the sense that they fall between 1 and 0 given that one uses the base 2 logarithm, with 0 meaning that the distributions are equal.
However, I'm not sure whether there is in fact an error somewhere and was wondering whether someone might be able to say 'yes it's correct' or 'no, you did something wrong'.
Here is the code:
from numpy import zeros, array
from math import sqrt, log
class JSD(object):
def __init__(self):
self.log2 = log(2)
def KL_divergence(self, p, q):
""" Compute KL divergence of two vectors, K(p || q)."""
return sum(p[x] * log((p[x]) / (q[x])) for x in range(len(p)) if p[x] != 0.0 or p[x] != 0)
def Jensen_Shannon_divergence(self, p, q):
""" Returns the Jensen-Shannon divergence. """
self.JSD = 0.0
weight = 0.5
average = zeros(len(p)) #Average
for x in range(len(p)):
average[x] = weight * p[x] + (1 - weight) * q[x]
self.JSD = (weight * self.KL_divergence(array(p), average)) + ((1 - weight) * self.KL_divergence(array(q), average))
return 1-(self.JSD/sqrt(2 * self.log2))
if __name__ == '__main__':
J = JSD()
p = [1.0/10, 9.0/10, 0]
q = [0, 1.0/10, 9.0/10]
print J.Jensen_Shannon_divergence(p, q)
The problem is that I feel that the scores are not high enough when comparing two text documents, for instance. However, this is purely a subjective feeling.
Any help is, as always, appreciated.
Note that the scipy entropy call below is the Kullback-Leibler divergence.
See: http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
#!/usr/bin/env python
from scipy.stats import entropy
from numpy.linalg import norm
import numpy as np
def JSD(P, Q):
_P = P / norm(P, ord=1)
_Q = Q / norm(Q, ord=1)
_M = 0.5 * (_P + _Q)
return 0.5 * (entropy(_P, _M) + entropy(_Q, _M))
Also note that the test case in the Question looks erred?? The sum of the p distribution does not add to 1.0.
See: http://www.itl.nist.gov/div898/handbook/eda/section3/eda361.htm
Since the Jensen-Shannon distance (distance.jensenshannon) has been included in Scipy 1.2, the Jensen-Shannon divergence can be obtained as the square of the Jensen-Shannon distance:
from scipy.spatial import distance
distance.jensenshannon([1.0/10, 9.0/10, 0], [0, 1.0/10, 9.0/10]) ** 2
# 0.5306056938642212
Get some data for distributions with known divergence and compare your results against those known values.
BTW: the sum in KL_divergence may be rewritten using the zip built-in function like this:
sum(_p * log(_p / _q) for _p, _q in zip(p, q) if _p != 0)
This does away with lots of "noise" and is also much more "pythonic". The double comparison with 0.0 and 0 is not necessary.
A general version, for n probability distributions, in python
import numpy as np
from scipy.stats import entropy as H
def JSD(prob_distributions, weights, logbase=2):
# left term: entropy of misture
wprobs = weights * prob_distributions
mixture = wprobs.sum(axis=0)
entropy_of_mixture = H(mixture, base=logbase)
# right term: sum of entropies
entropies = np.array([H(P_i, base=logbase) for P_i in prob_distributions])
wentropies = weights * entropies
sum_of_entropies = wentropies.sum()
divergence = entropy_of_mixture - sum_of_entropies
return(divergence)
# From the original example with three distributions:
P_1 = np.array([1/2, 1/2, 0])
P_2 = np.array([0, 1/10, 9/10])
P_3 = np.array([1/3, 1/3, 1/3])
prob_distributions = np.array([P_1, P_2, P_3])
n = len(prob_distributions)
weights = np.empty(n)
weights.fill(1/n)
print(JSD(prob_distributions, weights))
#0.546621319446
Explicitly following the math in the Wikipedia article:
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))

How to calculate cumulative normal distribution?

I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.
Here's an example:
>>> from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
In other words, approximately 95% of the standard normal interval lies within two standard deviations, centered on a standard mean of zero.
If you need the inverse CDF:
>>> norm.ppf(norm.cdf(1.96))
array(1.9599999999999991)
It may be too late to answer the question but since Google still leads people here, I decide to write my solution here.
That is, since Python 2.7, the math library has integrated the error function math.erf(x)
The erf() function can be used to compute traditional statistical functions such as the cumulative standard normal distribution:
from math import *
def phi(x):
#'Cumulative distribution function for the standard normal distribution'
return (1.0 + erf(x / sqrt(2.0))) / 2.0
Ref:
https://docs.python.org/2/library/math.html
https://docs.python.org/3/library/math.html
How are the Error Function and Standard Normal distribution function related?
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the cumulative distribution function (cdf - probability that a random sample X will be less than or equal to x) for a given mean (mu) and standard deviation (sigma):
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
NormalDist().cdf(1.96)
# 0.9750021048517796
NormalDist().cdf(-1.96)
# 0.024997895148220428
Adapted from here http://mail.python.org/pipermail/python-list/2000-June/039873.html
from math import *
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def ncdf(x):
return 1. - 0.5*erfcc(x/(2**0.5))
To build upon Unknown's example, the Python equivalent of the function normdist() implemented in a lot of libraries would be:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Alex's answer shows you a solution for standard normal distribution (mean = 0, standard deviation = 1). If you have normal distribution with mean and std (which is sqr(var)) and you want to calculate:
from scipy.stats import norm
# cdf(x < val)
print norm.cdf(val, m, s)
# cdf(x > val)
print 1 - norm.cdf(val, m, s)
# cdf(v1 < x < v2)
print norm.cdf(v2, m, s) - norm.cdf(v1, m, s)
Read more about cdf here and scipy implementation of normal distribution with many formulas here.
Taken from above:
from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
For a two-tailed test:
Import numpy as np
z = 1.96
p_value = 2 * norm.cdf(-np.abs(z))
0.04999579029644087
Simple like this:
import math
def my_cdf(x):
return 0.5*(1+math.erf(x/math.sqrt(2)))
I found the formula in this page https://www.danielsoper.com/statcalc/formulas.aspx?id=55

Categories

Resources