Sampling from posterior distribution of custom density in PyMC3 - python

I am trying to get the predictive distribution from my model, which happens to be a custom defined probability. Which happens to be a mixture of Normals.
with RE2:
trace = pm.variational.sample_vp(v_params, draws=5000)
trigger.set_value(triggers_test)
cc.set_value(cc_test)
y_output.set_value(np.zeros((len(y_test),)))
ppc = pm.sample_ppc(trace, model=RE2, samples=2000)
y_pred = ppc['R'].mean(axis=0)[:,None]
However, I get the error: AttributeError: 'DensityDist' object has no attribute 'random'. Is there a way to sample from the distribution? I am able to get the trace, and I can play around with this a bit, but I'm hoping that there is something better.
If it helps:
R = pm.DensityDist('R', logp_nmix(mus, stds, pi), observed=y_output)
I was able to get the posterior properly (i.e. pm.sample_ppc working) when pm.DensityDist was applied to a latent variable rather than a observed variable.

Related

'ThetaModelResults' object has no attribute 'get_prediction'

I am trying to create a theta model to forecast time series data, using statsmodels library.
that code is:
theta = ThetaModel(d, period = 12)
res_theta = theta.fit()
predictions_test = np.round(np.exp(res_theta.forecast(extra_periods, theta = 2).values))
Which gives me predictions over the value of extra_periods (set to 12).
However,I would also like to look at predictions made over the training data and I use the code:
predictions_train = res_theta.get_prediction()
Which results in:
AttributeError: 'ThetaModelResults' object has no attribute 'get_prediction'
Would anyone know how to get predictions for the training set using the theta model on python's statsmodel package? I searched the docs and couldn't find anything to solve my issue
I have been searching the docs for statsmodels and cannot find the information I am looking for

How to properly regularize a np.polyval() function?

I'm attempting to recreate and plot this regularization function
based off of Bishop's Pattern Recognition and ML book, but I am getting vastly different results.
I am using np.polyfit() to obtain the coefficients w and np.polyval() to calculate them with an input. I get a correctly fit model (though I'm aware of the overfitting problem) when lambda=0 (meaning no regularization has actually taken place), but I struggle to get anything even resembling the examples in the book when lambda is anything else.
Here's a snippet of the relevant code
def poly_fit(M,x,t):
poly = np.polyfit(x, t, deg=M)
return poly
poly = poly_fit(M,x,t)
reg = np.polyval(poly,rand_gen) + ((lamb/2)*(poly#poly))
where rand_gen is a vector with random variables, and in my three attempts I set lambda as lamb=np.exp(-np.inf), lamb=np.exp(-18), and lamb=np.exp(0)
This is what I'm attempting to recreate:
This is what I get:
ln(lambda)=-infinity:
ln(lambda)=-18:
ln(lambda)=0:

Python curve fitting using MLE and obtaining standard errors for parameter estimates

The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.

One-sample Cramer-VonMises test with unknown parameters in python

I am looking for a one-sample Cramer-Von Mises test for a normal distribution with unknown parameters in python.
I found some discussion here
https://github.com/chrisb83/scipy/commit/9274d22fc1ca7ce40596b01322be84c81352899d
but this does not seem to be released?
There is also this:
https://pypi.org/project/scikit-gof/
but these tests only work for fully specified distributions (i.e. known parameters).
Is anyone aware of a CVM-test implementation in python for a normal dist with unknown parameters?
Thanks
The test is done on the sample. Here an example using OpenTURNS in Python.
import openturns as ot
First let's build a random sample of size 200 from a centered standard Normal distribution
You may haveyour own data:
sample = ot.Sample([0, 0.3, -0.1, 0.23, -0.5], 1)
but OpenTurns offers a simple way to build samples
sample = ot.Normal().getSample(200)
Now to execute the Cramer-Von Mises normality test you just call this method
test_result = ot.NormalityTest.CramerVonMisesNormal(sample)
Then print the result
print('Component is normal?', test_result.getBinaryQualityMeasure(),
'p-value=%.6g' % test_result.getPValue(),
'threshold=%.6g' % test_result.getThreshold())
>>> Component is normal? True
p-value=0.624469
threshold=0.01
But always remember that the threshold is arbitrary and that the test can give false negative even though the sample really comes from a Normal distribution.
If you want to test a sample coming from a Uniform distribution, replace the line 'sample' by: sample = ot.Uniform().getSample(200)

Goodness of fit test for Weibull distribution in python

I have some data that I have to test to see if it comes from a Weibull distribution with unknown parameters. In R I could use https://cran.r-project.org/web/packages/KScorrect/index.html but I can't find anything in Python.
Using scipy.stats I can fit parameters with:
scipy.stats.weibull_min.fit(values)
However in order to turn this into a test I think I need to perform some Monte-Carlo simulation (e.g. https://en.m.wikipedia.org/wiki/Lilliefors_test) I am not sure what to do exactly.
How can I make such a test in Python?
The Lilliefors test is implemented in OpenTURNS. To do this, all you have to use the Factory which corresponds to the distribution you want to fit.
In the following script, I simulate a Weibull sample with size 10 and perform the Kolmogorov-Smirnov test using a sample size equal to 1000. This means that the KS statistics is simulated 1000 times.
import openturns as ot
sample=ot.WeibullMin().getSample(10)
ot.ResourceMap.SetAsUnsignedInteger("FittingTest-KolmogorovSamplingSize",1000)
distributionFactory = ot.WeibullMinFactory()
dist, result = ot.FittingTest.Kolmogorov(sample, distributionFactory, 0.01)
print('Conclusion=', result.getBinaryQualityMeasure())
print('P-value=', result.getPValue())
More details can be found at:
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_test.html
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_distribution.html
One way around: estimate distribution parameters, draw data from the estimated distribution and run KS test to check that both samples come from the same distribution.
Let's create some "original" data:
>>> values = scipy.stats.weibull_min.rvs( 0.33, size=1000)
Now,
>>> args = scipy.stats.weibull_min.fit(values)
>>> print(args)
(0.32176317627928856, 1.249788665927261e-09, 0.9268793667654682)
>>> scipy.stats.kstest(values, 'weibull_min', args=args, N=100000)
KstestResult(statistic=0.033808945722737016, pvalue=0.19877935361964738)
The last line is equivalent to:
scipy.stats.ks_2samp(values, scipy.stats.weibull_min.rvs(*args, size=100000))
So, once you estimate parameters of the distribution, you can test it pretty reliably. But the scipy estimator is not very good, it took me several runs to get even "close" to the original distribution.

Categories

Resources