Calculating scale/dispersion of Gamma GLM using statsmodels

Calculating scale/dispersion of Gamma GLM using statsmodels - python

I'm having trouble obtaining the dispersion parameter of simulated data using statsmodels' GLM function.
import statsmodels.api as sm
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
np.random.seed(1)
# Generate data
x=np.random.uniform(0, 100,50000)
x2 = sm.add_constant(x)
a = 0.5
b = 0.2
y_true = 1/(a+(b*x))
# Add error
scale = 2 # the scale parameter I'm trying to obtain
shape = y_true/scale # given that, for Gamma, mu = scale*shape
y = np.random.gamma(shape=shape, scale=scale)
# Run model
model = sm.GLM(y, x2, family=sm.families.Gamma()).fit()
model.summary()
Here's the summary from above:
Note that the coefficient estimates are correct (0.5 and 0.2), but the scale (21.995) is way off the scale I set (2).
Can someone point out what it is I'm misunderstanding/doing wrong? Thanks!

As Josef noted in the comments, statsmodels uses a different kind of parameterization.

Related

Wrong exponential sampling in PyMC

I'am obviously doing something wrong here... Please have a look at the following program. It runs well but gives me a lambda parameter for an exponential distribution which is far away from the parameter I used for generating random observations:
import numpy as np
import arviz as az
import pymc as pm
lambda_param = 0.25
random_size = 1000
x = np.random.exponential(lambda_param, random_size)
basic_model = pm.Model()
with basic_model:
_lam_ = pm.HalfNormal("lambda", sigma = 1)
Y_obs = pm.Exponential("Y_obs", lam = _lam_, observed = x)
start = pm.find_MAP(model = basic_model)
idata = pm.sample(1000, start = start)
summary = az.summary(idata, round_to = 6)
summary
Following my last running of the program, I find in summary a mean lambda greater than 4..., where lambda=0.25 as I used it.
Pointing the finger at my programing errors would be highly appreciated.

I found the problem, the uncertainty on _lam_ was too large and given that the exponential probability distribution is not symmetric, the high uncertainty modified the result. The fix is simply to use a smaller standard deviation, I also used Normal rather than HalfNormal for simplicity:
import numpy as np
import pymc3 as pm
import arviz as az
lambda_param = 0.25
random_size = 1000
x = np.random.exponential(lambda_param, random_size)
with pm.Model() as basic_model:
lam = pm.Normal("lam", mu=lambda_param, sigma=0.0001)
Y_obs = pm.Exponential("Y_obs", lam=lam, observed=x)
trace = pm.sample(1000, tune=1000)
summary = az.summary(trace, round_to=6)
summary
This gives a mean of 0.25 for lambda, within a small margin of error.

Python's Lmfit package not converging to a meaningful result

I'm running the code below:
import numpy as np
from lmfit import Model
def exp_model(x, ampl1=1.0, tau1=0.1):
exponential = ampl1*np.exp(-x/tau1)
return exponential
x = np.array([2.496,2.528,2.56,2.592,2.624])
y = np.array([8774.52,8361.68,7923.42,7502.43,7144.11])
dec_model = Model(exp_model, nan_policy='propagate')
results = dec_model.fit(y, x=x, ampl1=y[0])
results.plot()
The result I get is
which means that the fit is just failing for some reason. I can't figure out why. It had worked for similar data before. Any help would be greatly appreciated.

It wasn't converging because the initial value for the tau1 parameter was too far away from the real value. The code below works well.
import numpy as np
from lmfit import Model
def exp_model(x, ampl1=1.0, tau1=1.0): # The initial value of tau1 was changed from 0.1 to 1.0
exponential = ampl1*np.exp(-x/tau1)
return exponential
x = np.array([2.496,2.528,2.56,2.592,2.624])
y = np.array([8774.52,8361.68,7923.42,7502.43,7144.11])
dec_model = Model(exp_model, nan_policy='propagate')
results = dec_model.fit(y, x=x, ampl1=y[0])
results.plot()

GPFlow: how to account for uncertainties from mean model

In GPFlow one can add a fitted mean function to the GP regression. When doing this as in the basic example, the result is, that there will be no uncertainties due to the uncertainty in the fit of the mean. E.g. in the example below the error bars don't grow outside the range of available data, as the slope of the linear mean remains fixed at its optimized value. Is there a way to account for these uncertainties, such that the error bands grow when extrapolating?
(The question was originally stated in an issue report but moved here to be more accessible)
import numpy as np
import matplotlib.pyplot as plt
import gpflow
from gpflow.utilities import print_summary
def f(x):
return np.sin(3*x) + x
xtrain = np.linspace(0, 3, 50).reshape([-1, 1])
ytrain = f(xtrain) + 0.5*(np.random.randn(len(xtrain)).reshape([-1, 1]) - 0.5)
k = gpflow.kernels.SquaredExponential()
meanf = gpflow.mean_functions.Linear()
m = gpflow.models.GPR(data=(xtrain, ytrain), kernel=k, mean_function=meanf)
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100))
print_summary(m)
xpl = np.linspace(-5, 10, 100).reshape(100, 1)
mean, var = m.predict_f(xpl)
plt.figure(figsize=(12, 6))
plt.plot(xtrain, ytrain, 'x')
plt.plot(xpl, mean, 'C0', lw=2)
plt.fill_between(xpl[:, 0],
mean[:, 0] - 1.96 * np.sqrt(var[:,0]),
mean[:, 0] + 1.96 * np.sqrt(var[:,0]),
color='C0', alpha=0.2)

Most of GPflow's models only optimise for the MAP estimate of the hyperparameters of the kernel, mean function and likelihood. The models do not account for uncertainty on these hyperparameters during training or prediction. While this could be limiting for certain problems, we often find that this is a sensible compromise between computational complexity and uncertainty quantification.
That being said, in your specific case (i.e. a linear mean function) we can account for uncertainty in the linear trend of the data by specifying a linear kernel function, rather than a linear mean function.
Using your snippet with this model specification:
k = gpflow.kernels.SquaredExponential() + gpflow.kernels.Linear()
meanf = gpflow.mean_functions.Zero()
m = gpflow.models.GPR(data=(xtrain, ytrain), kernel=k, mean_function=meanf)
Gives the following fit, with error bars that grow outside the data range:

QQplot for discrete distribution

I have a set whose samples are discrete values (in particular, the size of a queue over time). Now I'd like to find what distribution they belong to. To achieve this goal I'd act the same way I did for the other quantities, i.e. plotting a qqplot, launching
import statsmodels.api as sm
sm.qqplot(df, dist = 'geom', sparams = (.5,), line ='s', alpha = 0.3, marker ='.')
This works if dist is not a discrete random variables (e.g. 'exp' or 'norm') and indeed I used to get some results, but when the distribution is discrete (say, 'geom'), I get
AttributeError: 'geom_gen' object has no attribute 'fit'
I searched on the Internet how to make a qqplot (or something similar) to spot what distribution my samples belong to but I found nothing

def discreteQQ(x_sample):
p_test = np.array([])
for i in range(0, 1001):
p_test = np.append(p_test, i/1000)
i = i + 1
x_sample = np.sort(x_sample)
x_theor = stats.geom.rvs(.5, size=len(x_sample))
ecdf_sample = np.arange(1, len(x_sample) + 1)/(len(x_sample)+1)
x_theor = stats.geom.ppf(ecdf_sample, p=0.5)
for p in p_test:
plt.scatter(np.quantile(x_theor, p), np.quantile(x_sample, p), c = 'blue')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Sample quantiles')
plt.show()

Generate a theoretical geometric distribution using scipy.stats.geom, convert the sample and theoretical data using statsmodels' ProbPlot and pass these to statsmodels' qqplot_2samples.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.graphics.gofplots import qqplot_2samples
p_theor = 1/4 # The probability we check for
p_sample = 1/5 # The true probability of the sample distribution
# The experimental data
x_sample = stats.geom.rvs(p_sample, size=50)
# The model data
x_theor = stats.geom.rvs(p_theor, size=100)
qqplot_2samples(ProbPlot(x_sample), ProbPlot(x_theor), line='45')
plt.show()

Single Component Metropolis-Hastings

So, let's say I have the following 2-dimensional target distribution that I would like to sample from (a mixture of bivariate normal distributions) -
import numba
import numpy as np
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
def targ_dist(x):
target = (stats.multivariate_normal.pdf(x,[0,0],[[1,0],[0,1]])+stats.multivariate_normal.pdf(x,[-6,-6],[[1,0.9],[0.9,1]])+stats.multivariate_normal.pdf(x,[4,4],[[1,-0.9],[-0.9,1]]))/3
return target
and the following proposal distribution (a bivariate random walk) -
def T(x,y,sigma):
return stats.multivariate_normal.pdf(y,x,[[sigma**2,0],[0,sigma**2]])
The following is the Metropolis Hastings code for updating the "entire" state in every iteration -
#Initialising
n_iter = 30000
# tuning parameter i.e. variance of proposal distribution
sigma = 2
# initial state
X = stats.uniform.rvs(loc=-5, scale=10, size=2, random_state=None)
# count number of acceptances
accept = 0
# store the samples
MHsamples = np.zeros((n_iter,2))
# MH sampler
for t in range(n_iter):
# proposals
Y = X+stats.norm.rvs(0,sigma,2)
# accept or reject
u = stats.uniform.rvs(loc=0, scale=1, size=1)
# acceptance probability
r = (targ_dist(Y)*T(Y,X,sigma))/(targ_dist(X)*T(X,Y,sigma))
if u < r:
X = Y
accept += 1
MHsamples[t] = X
However, I would like to update "per component" (i.e. component-wise updating) in every iteration. Is there a simple way of doing this?
Thank you for your help!

From the tone of your question I assume you are looking performance improvements.
MonteCarlo algorithms are quite compute intensive. You will get better results, if you perform in algorithms on a lower level than in an interpreted language like python, e.g. writing a c-extension.
There are also implementations available for python (PyStan, PyMC3).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating scale/dispersion of Gamma GLM using statsmodels - python

As Josef noted in the comments, statsmodels uses a different kind of parameterization.

Related

Wrong exponential sampling in PyMC

Python's Lmfit package not converging to a meaningful result

GPFlow: how to account for uncertainties from mean model

QQplot for discrete distribution

Single Component Metropolis-Hastings

Categories

Resources