Goodness of fit test for Weibull distribution in python - python

I have some data that I have to test to see if it comes from a Weibull distribution with unknown parameters. In R I could use https://cran.r-project.org/web/packages/KScorrect/index.html but I can't find anything in Python.
Using scipy.stats I can fit parameters with:
scipy.stats.weibull_min.fit(values)
However in order to turn this into a test I think I need to perform some Monte-Carlo simulation (e.g. https://en.m.wikipedia.org/wiki/Lilliefors_test) I am not sure what to do exactly.
How can I make such a test in Python?

The Lilliefors test is implemented in OpenTURNS. To do this, all you have to use the Factory which corresponds to the distribution you want to fit.
In the following script, I simulate a Weibull sample with size 10 and perform the Kolmogorov-Smirnov test using a sample size equal to 1000. This means that the KS statistics is simulated 1000 times.
import openturns as ot
sample=ot.WeibullMin().getSample(10)
ot.ResourceMap.SetAsUnsignedInteger("FittingTest-KolmogorovSamplingSize",1000)
distributionFactory = ot.WeibullMinFactory()
dist, result = ot.FittingTest.Kolmogorov(sample, distributionFactory, 0.01)
print('Conclusion=', result.getBinaryQualityMeasure())
print('P-value=', result.getPValue())
More details can be found at:
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_test.html
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_distribution.html

One way around: estimate distribution parameters, draw data from the estimated distribution and run KS test to check that both samples come from the same distribution.
Let's create some "original" data:
>>> values = scipy.stats.weibull_min.rvs( 0.33, size=1000)
Now,
>>> args = scipy.stats.weibull_min.fit(values)
>>> print(args)
(0.32176317627928856, 1.249788665927261e-09, 0.9268793667654682)
>>> scipy.stats.kstest(values, 'weibull_min', args=args, N=100000)
KstestResult(statistic=0.033808945722737016, pvalue=0.19877935361964738)
The last line is equivalent to:
scipy.stats.ks_2samp(values, scipy.stats.weibull_min.rvs(*args, size=100000))
So, once you estimate parameters of the distribution, you can test it pretty reliably. But the scipy estimator is not very good, it took me several runs to get even "close" to the original distribution.

Related

Python curve fitting using MLE and obtaining standard errors for parameter estimates

The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.

Generating gamma random samples in python

can someone help me to generate random numbers from the gamma distribution in python, i have tried these two possibilities but i'm still wondering about the main difference between them :
The first one is :
shape, scale= 0.5,1
size=(1024,10)
np.random.gamma(shape, scale, size)
and the second one is :
from scipy.stats import gamma
gamma.rvs(0.5, 1, (1024,10))
i think both of them are used to generate random samples following the gamma distribution, so what's the difference between these syntaxes. When should we use the first method and when the second one ?
There is no difference between the two except for the fact that one is from bumpy and other from scipy library. The probability density function used to create gamma distribution is same in both the cases.

One-sample Cramer-VonMises test with unknown parameters in python

I am looking for a one-sample Cramer-Von Mises test for a normal distribution with unknown parameters in python.
I found some discussion here
https://github.com/chrisb83/scipy/commit/9274d22fc1ca7ce40596b01322be84c81352899d
but this does not seem to be released?
There is also this:
https://pypi.org/project/scikit-gof/
but these tests only work for fully specified distributions (i.e. known parameters).
Is anyone aware of a CVM-test implementation in python for a normal dist with unknown parameters?
Thanks
The test is done on the sample. Here an example using OpenTURNS in Python.
import openturns as ot
First let's build a random sample of size 200 from a centered standard Normal distribution
You may haveyour own data:
sample = ot.Sample([0, 0.3, -0.1, 0.23, -0.5], 1)
but OpenTurns offers a simple way to build samples
sample = ot.Normal().getSample(200)
Now to execute the Cramer-Von Mises normality test you just call this method
test_result = ot.NormalityTest.CramerVonMisesNormal(sample)
Then print the result
print('Component is normal?', test_result.getBinaryQualityMeasure(),
'p-value=%.6g' % test_result.getPValue(),
'threshold=%.6g' % test_result.getThreshold())
>>> Component is normal? True
p-value=0.624469
threshold=0.01
But always remember that the threshold is arbitrary and that the test can give false negative even though the sample really comes from a Normal distribution.
If you want to test a sample coming from a Uniform distribution, replace the line 'sample' by: sample = ot.Uniform().getSample(200)

Sampling from posterior distribution of custom density in PyMC3

I am trying to get the predictive distribution from my model, which happens to be a custom defined probability. Which happens to be a mixture of Normals.
with RE2:
trace = pm.variational.sample_vp(v_params, draws=5000)
trigger.set_value(triggers_test)
cc.set_value(cc_test)
y_output.set_value(np.zeros((len(y_test),)))
ppc = pm.sample_ppc(trace, model=RE2, samples=2000)
y_pred = ppc['R'].mean(axis=0)[:,None]
However, I get the error: AttributeError: 'DensityDist' object has no attribute 'random'. Is there a way to sample from the distribution? I am able to get the trace, and I can play around with this a bit, but I'm hoping that there is something better.
If it helps:
R = pm.DensityDist('R', logp_nmix(mus, stds, pi), observed=y_output)
I was able to get the posterior properly (i.e. pm.sample_ppc working) when pm.DensityDist was applied to a latent variable rather than a observed variable.

Is there a Python equivalent to the smooth.spline function in R

The smooth.spline function in R allows a tradeoff between roughness (as defined by the integrated square of the second derivative) and fitting the points (as defined by summing the squares of the residuals). This tradeoff is accomplished by the spar or df parameter. At one extreme you get the least squares line, and the other you get a very wiggly curve which intersects all of the data points (or the mean if you have duplicated x values with different y values)
I have looked at scipy.interpolate.UnivariateSpline and other spline variants in Python, however, they seem to only tradeoff by increasing the number of knots, and setting a threshold (called s) for the allowed SS residuals. By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.
Does Python have a spline fitting mechanism that behaves in this way? Allowing all knots but penalizing the second derivative?
You can use R functions in Python with rpy2:
import rpy2.robjects as robjects
r_y = robjects.FloatVector(y_train)
r_x = robjects.FloatVector(x_train)
r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(x=r_x, y=r_y, spar=0.7)
ySpline=np.array(robjects.r['predict'](spline1,robjects.FloatVector(x_smooth)).rx2('y'))
plt.plot(x_smooth,ySpline)
If you want to directly set lambda: spline1 = r_smooth_spline(x=r_x, y=r_y, lambda=42) doesn't work, because lambda has already another meaning in Python, but there is a solution: How to use the lambda argument of smooth.spline in RPy WITHOUT Python interprating it as lambda.
To get the code running you first need to define the data x_train and y_train and you can define x_smooth=np.array(np.linspace(-3,5,1920)). if you want to plot it between -3 and 5 in Full-HD-resolution.
Note that this code is not fully compatible with Jupyter-notebooks for the latest versions of rpy2. You can fix this by using !pip install -Iv rpy2==3.4.2 as described in NotImplementedError: Conversion 'rpy2py' not defined for objects of type '<class 'rpy2.rinterface.SexpClosure'>' only after I run the code twice
I've been looking for exactly the same thing, but would rather not have to translate the code to Python. The Splinter package seems like an option, however: https://github.com/bgrimstad/splinter
From research on google, I concluded that
By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.

Categories

Resources