Modified BPMF in PyMC3 using `LKJCorr` priors: PositiveDefiniteError using `NUTS` - python

I previously implemented the original Bayesian Probabilistic Matrix Factorization (BPMF) model in pymc3. See my previous question for reference, data source, and problem setup. Per the answer to that question from #twiecki, I've implemented a variation of the model using LKJCorr priors for the correlation matrices and uniform priors for the standard deviations. In the original model, the covariance matrices are drawn from Wishart distributions, but due to current limitations of pymc3, the Wishart distribution cannot be sampled from properly. This answer to a loosely related question provides a succinct explanation for the choice of LKJCorr priors. The new model is below.
import pymc3 as pm
import numpy as np
import theano.tensor as t
n, m = train.shape
dim = 10 # dimensionality
beta_0 = 1 # scaling factor for lambdas; unclear on its use
alpha = 2 # fixed precision for likelihood function
std = .05 # how much noise to use for model initialization
# We will use separate priors for sigma and correlation matrix.
# In order to convert the upper triangular correlation values to a
# complete correlation matrix, we need to construct an index matrix:
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
logging.info('building the BPMF model')
with pm.Model() as bpmf:
# Specify user feature matrix
sigma_u = pm.Uniform('sigma_u', shape=dim)
corr_triangle_u = pm.LKJCorr(
'corr_u', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_u = corr_triangle_u[tri_index]
corr_matrix_u = t.fill_diagonal(corr_matrix_u, 1)
cov_matrix_u = t.diag(sigma_u).dot(corr_matrix_u.dot(t.diag(sigma_u)))
lambda_u = t.nlinalg.matrix_inverse(cov_matrix_u)
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
U = pm.MvNormal(
'U', mu=mu_u, tau=lambda_u,
shape=(n, dim), testval=np.random.randn(n, dim) * std)
# Specify item feature matrix
sigma_v = pm.Uniform('sigma_v', shape=dim)
corr_triangle_v = pm.LKJCorr(
'corr_v', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_v = corr_triangle_v[tri_index]
corr_matrix_v = t.fill_diagonal(corr_matrix_v, 1)
cov_matrix_v = t.diag(sigma_v).dot(corr_matrix_v.dot(t.diag(sigma_v)))
lambda_v = t.nlinalg.matrix_inverse(cov_matrix_v)
mu_v = pm.Normal(
'mu_v', mu=0, tau=beta_0 * lambda_v, shape=dim,
testval=np.random.randn(dim) * std)
V = pm.MvNormal(
'V', mu=mu_v, tau=lambda_v,
testval=np.random.randn(m, dim) * std)
# Specify rating likelihood function
R = pm.Normal(
'R', mu=t.dot(U, V.T), tau=alpha * np.ones((n, m)),
observed=train)
# `start` is the start dictionary obtained from running find_MAP for PMF.
# See the previous post for PMF code.
for key in bpmf.test_point:
if key not in start:
start[key] = bpmf.test_point[key]
with bpmf:
step = pm.NUTS(scaling=start)
The goal with this reimplementation was to produce a model that could be estimated using the NUTS sampler. Unfortunately, I'm still getting the same error at the last line:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 ... 1030 1031 1032 1033 1034 ]
I've made all the code for PMF, BPMF, and this modified BPMF available in this gist to make it simple to replicate the error. All you need to do is download the data (also referenced in the gist).

It looks like you are passing the complete precision matrix into the normal distribution:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
I assume you only want to pass the diagonal values:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * t.diag(lambda_u), shape=dim,
testval=np.random.randn(dim) * std)
Does this change to mu_u and mu_v fix it for you?

Related

Image reconstruction with compressed sensing

I'm trying to code a demonstration of compressed sensing for my final year project but am getting poor image reconstruction when using the Lasso algorithm. I've relied on the following as a reference: http://www.pyrunner.com/weblog/2016/05/26/compressed-sensing-python/
However my code has some differences:
I use scikit-learn to perform a lasso optimisation (basis pursuit) as opposed to using cvxpy to perform an l_1 minimisation with an equality constraint as in the article.
I construct psi differently/more simply, testing seems to show that it's correct.
I use a different package to read and write the image.
import numpy as np
import scipy.fftpack as spfft
import scipy.ndimage as spimg
import imageio
from sklearn.linear_model import Lasso
x_orig = imageio.imread('gt40.jpg', pilmode='L') # read in grayscale
x = spimg.zoom(x_orig, 0.2) #zoom for speed
ny,nx = x.shape
k = round(nx * ny * 0.5) #50% sample
ri = np.random.choice(nx * ny, k, replace=False)
y = x.T.flat[ri] #y is the measured sample
# y = np.expand_dims(y, axis=1) ---- this doesn't seem to make a difference, was presumably required with cvxpy
psi = spfft.idct(np.identity(nx*ny), norm='ortho', axis=0) #my construction of psi
# psi = np.kron(
# spfft.idct(np.identity(nx), norm='ortho', axis=0),
# spfft.idct(np.identity(ny), norm='ortho', axis=0)
# )
# psi = 2*np.random.random_sample((nx*ny,nx*ny)) - 1
theta = psi[ri,:] #equivalent to phi*psi
lasso = Lasso(alpha=0.001, max_iter=10000)
lasso.fit(theta, y)
s = np.array(lasso.coef_)
x_recovered = psi#s
x_recovered = x_recovered.reshape(nx, ny).T
x_recovered_final = x_recovered.astype('uint8') #recovered image is float64 and has negative values..
imageio.imwrite('gt40_recovered.jpg', x_recovered_final)
Unfortunately I'm not allowed to post images yet so here is a link to the original zoomed image, the image recovered with lasso and the image recovered with cvxpy (described later):
https://imgur.com/a/LROSug6
As you can see not only is the recovery poor but the image completely corrupted - the colours seem to be negative and the detail from the 50% sample lost. I think I've managed to track down the problem to the Lasso regression - it returns a vector that, when inverse transformed, has values that are not necessarily in the 0-255 range as expected for the image. So the conversion to from dtype float64 to uint8 is rather random (e.g. -55 becomes 255-55=200).
Following this I tried swapping out lasso for the same optimisation as in the article (minimising the l_1 norm subject to theta*s=y using cvxpy):
import cvxpy as cvx
x_orig = imageio.imread('gt40.jpg', pilmode='L') # read in grayscale
x = spimg.zoom(x_orig, 0.2)
ny,nx = x.shape
k = round(nx * ny * 0.5)
ri = np.random.choice(nx * ny, k, replace=False)
y = x.T.flat[ri]
psi = spfft.idct(np.identity(nx*ny), norm='ortho', axis=0)
theta = psi[ri,:] #equivalent to phi*psi
#NEW CODE STARTS:
vx = cvx.Variable(nx * ny)
objective = cvx.Minimize(cvx.norm(vx, 1))
constraints = [theta#vx == y]
prob = cvx.Problem(objective, constraints)
result = prob.solve(verbose=True)
s = np.array(vx.value).squeeze()
x_recovered = psi#s
x_recovered = x_recovered.reshape(nx, ny).T
x_recovered_final = x_recovered.astype('uint8')
imageio.imwrite('gt40_recovered_altopt.jpg', x_recovered_final)
This took nearly 6 hours but finally I got a somewhat satisfactory result. However I would like to perform a demonstration of lasso if possible. Any help in getting the lasso to return appropriate values or somehow converting its result appropriately would be very much appreciated.

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

How to fit multiple datasets which have a combination of shared and non-shared parameters

I'm trying to fit multiple dataset which should have some variables which are shared by between datasets and others which are not. However I'm unsure of the steps I need to take to do this. Below I've shown the approach that I'm trying to use (from 'Issues begin here' doesn't work, and it just for illustrative purposes).
In this answer somebody is able to share parameters arcoss datasets is there some way that this could be adapted so that I can also have some non-shared parameters?
Does anybody have an idea how I could achieve this, or could somebody suggegst a better approach to acheive the same result? Thanks.
import numpy as np
from scipy.stats import gamma
import matplotlib.pyplot as plt
import pandas as pd
from lmfit import minimize, Minimizer, Parameters, Parameter, report_fit, Model
# Create datasets to fit
a = 1.99
start = gamma.ppf(0.001, a)
stop = gamma.ppf(.99, a)
xvals = np.linspace(start, stop, 100)
yvals = gamma.pdf(xvals, a)
data_dict = {}
for dataset in range(4):
name = 'dataset_' + str(dataset)
rand_offset = np.random.uniform(-.1, .1)
noise = np.random.uniform(-.05, .05,len(yvals)) + rand_offset
data_dict[name] = yvals + noise
df = pd.DataFrame(data_dict)
# Create some non-shared parameters
non_shared_param1 = np.random.uniform(0.009, .21, 4)
non_shared_param2 = np.random.uniform(0.01, .51, 4)
# Create the independent variable
ind_var = np.linspace(.001,100,100)
# Create a model
def model_func(time, Qi, at, vw, R, rhob_cb, al, NSP1, NSP2):
Dt = at * vw
Dl = al * vw
t = time
first_bot = 8 * np.pi * t * rhob_cb
sec_bot = np.sqrt((np.pi * (Dl * R) * t))
exp_top = R * np.power((NSP1 - ((t * vw)/R)), 2)
exp_bot = 4 * Dl * t
exp_top2 = R * np.power(NSP2, 2)
exp_bot2 = 4 * Dt * t
return (Qi / first_bot * sec_bot) * np.exp(- (exp_top / exp_bot) - (exp_top2 / exp_bot2))
model = Model(model_func)
### Issues begin here ###
all_results = {}
index = 0
for col in df:
# This block assigns the correct non-shared parameter for the particular fit
nsp1 = non_shared_param1[index]
nsp2 = non_shared_param2[index]
index += 1
params = Parameters()
at = 0.1
al = 0.15
vw = 10**-4
Dt = at * vw
Dl = al * vw
# Non-shared parameters
model.set_param_hint('NSP1', value = nsp1)
model.set_param_hint('NSP2', value = nsp2)
# Shared and varying parameters
model.set_param_hint('vw', value =10**-4, min=10**-10)
model.set_param_hint('at', value =0.1)
model.set_param_hint('al', value =0.15)
# Shared and fixed parameters
model.set_param_hint('Qi', value = 1000, vary = True)
model.set_param_hint('R', value = 1.7, vary = True)
model.set_param_hint('rhob_cb', value =2895, vary = True)
# One set of parameters should be returned
result = model.fit(df[col], time = ind_var)
all_results[index] = result
A fit with lmfit always uses a single instance of a Parameters object, it does not take multiple Parameters objects.
In order to simultaneously fit multiple data sets with the similar models (perhaps the same mathematical model, but expecting different parameter values for each model), you need to have an objective function that concatenates the residuals from the different component models. And each of those models has to have parameters that are taken from the single instance of Parameters(), each parameter having a unique name.
So, to fit 2 data sets with the same function (let's use Gaussian, with parameters "center", "amplitude", and "sigma"), you might define the Parameters as
params = Parameters()
params.add('center_1', 5., vary=True)
params.add('amplitude_1', 10., vary=True)
params.add('sigma_1', 1.0, vary=True
params.add('center_2', 8., vary=True)
params.add('amplitude_2', 3., vary=True)
params.add('sigma_2', 2.0, vary=True)
Then use 'center_1', 'amplitude_1', and 'sigma_1' to calculate the model for the first data set and 'center_2', etc to calculate the the model for the second, perhaps as
def residual(params, x, datasets):
model1 = params['amplitude_1'] * gaussian(x, params['center_1'], params['sigma_1'])
model2 = params['amplitude_2'] * gaussian(x, params['center_2'], params['sigma_2']
resid1 = datasets[0] - model1
resid2 = datasets[1] - model2
return np.concatenate((resid1, resid2))
fit = lmfit.minimize(residual, params, fcn_args=(x, datasets))
As you might be able to see from this, Parameter values are independent by default. In order to share parameter values to be used in different datasets, you have to do that explicitly (as shown in the linked answer you provide).
For example, if you want to require the sigma values to be the same, would not change the residual function, just the Parameter definition above to:
params.add('sigma_2', expr='sigma_1')
You could require the two amplitudes to add to some value:
params.add('amplitude_2', expr='10 - amplitude_1')
or perhaps you would want to ensure that 'center_2' is larger than 'center_1', but by an amount to be determined in the fit:
params.add('center_offset', value=0.5, min=0)
params.add('center_2', expr='center_1 + center_offset')
Those are all ways to tie parameter values. By default, they're independent. Of course, you can also have some parameters that get used in all the models (say, just call the parameter 'sigma' and use it in for all models).

pymc3: how to model correlated intercept and slope in multilevel linear regression

In the Pymc3 example for multilevel linear regression (the example is here, with the radon data set from Gelman et al.’s (2007)), the intercepts (for different counties) and slopes (for apartment with and without basement) each have a Normal prior. How can I model them together with a multivariate normal prior, so that I can examine the correlation between them?
The hierarchical model given in the example is like this:
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Intercept for each county, distributed around group mean mu_a
# Above we just set mu and sd to a fixed value while here we
# plug in a common group distribution for all a and b (which are
# vectors of length n_counties).
a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=n_counties)
# Intercept for each county, distributed around group mean mu_a
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=n_counties)
# Model error
eps = pm.HalfCauchy('eps', 5)
radon_est = a[county_idx] + b[county_idx] * data.floor.values
# Data likelihood
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
hierarchical_trace = pm.sample(2000)
And I'm trying to make some change to the priors
with pm.Model() as correlation_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
# here I want to model a and b together
# I borrowed some code from a multivariate normal model
# but the code does not work
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = T.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1)
cov = pm.Deterministic('cov', T.nlinalg.matrix_dot(sigma, C, sigma))
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
a, b = pm.MvNormal('mu', mu=(mu_a, mu_b), tau=tau,
shape=(n_counties, n_counties))
# Model error
eps = pm.HalfCauchy('eps', 5)
radon_est = a[county_idx] + b[county_idx] * data.floor.values
# Data likelihood
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
correlation_trace = pm.sample(2000)
Here is the error message I got:
File "<ipython-input-108-ce400c54cc39>", line 14, in <module>
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
File "/home/olivier/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 611, in __call__
node = self.make_node(*inputs, **kwargs)
File "/home/olivier/anaconda3/lib/python3.5/site-packages/theano/tensor/nlinalg.py", line 73, in make_node
assert x.ndim == 2
AssertionError
Clearly I've made some mistakes about the covariance matrix, but I'm new to pymc3 and completely new to theano so have no idea how to fix it. I gather this should be a rather common use case so maybe there have been some examples on it? I just can't find them.
The full replicable code and data can be seen on the example page (link given above). I didn't include it here because it's too long and also I thought those familiar with pymc3 are very likely already quite familiar with it:)
You forgot to add one line when creating the covariance matrix you miss-specified the shape of the MvNormal. Your model should look something like this:
with pm.Model() as correlation_model:
mu = pm.Normal('mu', mu=0., sd=10, shape=2)
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = tt.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1.)
sigma_diag = tt.nlinalg.diag(sigma) # this line
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
tau = tt.nlinalg.matrix_inverse(cov)
ab = pm.MvNormal('ab', mu=mu, tau=tau, shape=(n_counties, 2))
eps = pm.HalfCauchy('eps', 5)
radon_est = ab[:,0][county_idx] + ab[:,1][county_idx] * data.floor.values
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
trace = pm.sample(2000)
Notice that alternatively, you can evaluate the correlation of the intercept and the slope from the posterior of hierarchical_model. You can use a frequentist method or build another Bayesian model, that takes as the observed data the result of hierarchical_model. May be this could be faster.
EDIT
If you want to evaluate the correlation of two variables from the posterior you can do something like.
chain = hierarchical_trace[100:]
x_0 = chain['mu_a']
x_1 = chain['mu_b']
X = np.vstack((x_0, x_1)).T
and then you can run the following model:
with pm.Model() as correlation:
mu = pm.Normal('mu', mu=0., sd=10, shape=2)
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = tt.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1.)
sigma_diag = tt.nlinalg.diag(sigma)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
tau = tt.nlinalg.matrix_inverse(cov)
yl = pm.MvNormal('yl', mu=mu, tau=tau, shape=(2, 2), observed=X)
trace = pm.sample(5000, pm.Metropolis())
You can replace x_0 and x_1 according to your needs. For example you may want to do:
x_0 = np.random.normal(chain['mu_a'], chain['sigma_a'])
x_1 = np.random.normal(chain['mu_b'], chain['sigma_b'])

Fitting negative binomial in python

In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))

Categories

Resources