I am looking for a one-sample Cramer-Von Mises test for a normal distribution with unknown parameters in python.
I found some discussion here
https://github.com/chrisb83/scipy/commit/9274d22fc1ca7ce40596b01322be84c81352899d
but this does not seem to be released?
There is also this:
https://pypi.org/project/scikit-gof/
but these tests only work for fully specified distributions (i.e. known parameters).
Is anyone aware of a CVM-test implementation in python for a normal dist with unknown parameters?
Thanks
The test is done on the sample. Here an example using OpenTURNS in Python.
import openturns as ot
First let's build a random sample of size 200 from a centered standard Normal distribution
You may haveyour own data:
sample = ot.Sample([0, 0.3, -0.1, 0.23, -0.5], 1)
but OpenTurns offers a simple way to build samples
sample = ot.Normal().getSample(200)
Now to execute the Cramer-Von Mises normality test you just call this method
test_result = ot.NormalityTest.CramerVonMisesNormal(sample)
Then print the result
print('Component is normal?', test_result.getBinaryQualityMeasure(),
'p-value=%.6g' % test_result.getPValue(),
'threshold=%.6g' % test_result.getThreshold())
>>> Component is normal? True
p-value=0.624469
threshold=0.01
But always remember that the threshold is arbitrary and that the test can give false negative even though the sample really comes from a Normal distribution.
If you want to test a sample coming from a Uniform distribution, replace the line 'sample' by: sample = ot.Uniform().getSample(200)
Related
The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.
can someone help me to generate random numbers from the gamma distribution in python, i have tried these two possibilities but i'm still wondering about the main difference between them :
The first one is :
shape, scale= 0.5,1
size=(1024,10)
np.random.gamma(shape, scale, size)
and the second one is :
from scipy.stats import gamma
gamma.rvs(0.5, 1, (1024,10))
i think both of them are used to generate random samples following the gamma distribution, so what's the difference between these syntaxes. When should we use the first method and when the second one ?
There is no difference between the two except for the fact that one is from bumpy and other from scipy library. The probability density function used to create gamma distribution is same in both the cases.
I have some data that I have to test to see if it comes from a Weibull distribution with unknown parameters. In R I could use https://cran.r-project.org/web/packages/KScorrect/index.html but I can't find anything in Python.
Using scipy.stats I can fit parameters with:
scipy.stats.weibull_min.fit(values)
However in order to turn this into a test I think I need to perform some Monte-Carlo simulation (e.g. https://en.m.wikipedia.org/wiki/Lilliefors_test) I am not sure what to do exactly.
How can I make such a test in Python?
The Lilliefors test is implemented in OpenTURNS. To do this, all you have to use the Factory which corresponds to the distribution you want to fit.
In the following script, I simulate a Weibull sample with size 10 and perform the Kolmogorov-Smirnov test using a sample size equal to 1000. This means that the KS statistics is simulated 1000 times.
import openturns as ot
sample=ot.WeibullMin().getSample(10)
ot.ResourceMap.SetAsUnsignedInteger("FittingTest-KolmogorovSamplingSize",1000)
distributionFactory = ot.WeibullMinFactory()
dist, result = ot.FittingTest.Kolmogorov(sample, distributionFactory, 0.01)
print('Conclusion=', result.getBinaryQualityMeasure())
print('P-value=', result.getPValue())
More details can be found at:
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_test.html
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_distribution.html
One way around: estimate distribution parameters, draw data from the estimated distribution and run KS test to check that both samples come from the same distribution.
Let's create some "original" data:
>>> values = scipy.stats.weibull_min.rvs( 0.33, size=1000)
Now,
>>> args = scipy.stats.weibull_min.fit(values)
>>> print(args)
(0.32176317627928856, 1.249788665927261e-09, 0.9268793667654682)
>>> scipy.stats.kstest(values, 'weibull_min', args=args, N=100000)
KstestResult(statistic=0.033808945722737016, pvalue=0.19877935361964738)
The last line is equivalent to:
scipy.stats.ks_2samp(values, scipy.stats.weibull_min.rvs(*args, size=100000))
So, once you estimate parameters of the distribution, you can test it pretty reliably. But the scipy estimator is not very good, it took me several runs to get even "close" to the original distribution.
I am trying to get the predictive distribution from my model, which happens to be a custom defined probability. Which happens to be a mixture of Normals.
with RE2:
trace = pm.variational.sample_vp(v_params, draws=5000)
trigger.set_value(triggers_test)
cc.set_value(cc_test)
y_output.set_value(np.zeros((len(y_test),)))
ppc = pm.sample_ppc(trace, model=RE2, samples=2000)
y_pred = ppc['R'].mean(axis=0)[:,None]
However, I get the error: AttributeError: 'DensityDist' object has no attribute 'random'. Is there a way to sample from the distribution? I am able to get the trace, and I can play around with this a bit, but I'm hoping that there is something better.
If it helps:
R = pm.DensityDist('R', logp_nmix(mus, stds, pi), observed=y_output)
I was able to get the posterior properly (i.e. pm.sample_ppc working) when pm.DensityDist was applied to a latent variable rather than a observed variable.
I am trying to estimate few parameters using the constained maximum likelihood in R and more specifically the constrOptim() from the stata package in R. I am programming in Python and using R via the RPy2.
In my model, I am assuming that the data follow the Beta-distribution, so I created a simulated dataset by using prespecified values for the parameters and now I am trying to estimate these parameters in order to verify that my estimation program works fine.
What I have observed is that my estimation is quite sensitive to the initial parameters. For example I have 11 parameters to estimate (let's call the parameters as pam1..pam11) and their true value is:
pam1=0.2 pam2=0.3 pam3=0.4 pam4=0.7 pam5=0.55 pam6=0.45 pam7=0.1 pam8=0.01 pam9=0.01 pam10=45 pam11=45
In the constrOptim() I am setting the starting parameters as:
start_param=FloatVector((pam1,pam2,pam3,pam4,pam5,pam6,pam7,pam8,pam9,pam10,pam,11))
where I set the starting values. I have observed that when I am using different sets of starting values the results change. For example when I am using the set
start_param=FloatVector((0.2,0.3,0.4,0.6,0.7,0.8,0.3,0.011,0.011,15,15))
and I obtain the following estimates
$par
[1] 0.20851065 0.30348571 0.43616932 0.73695654 0.58287221
0.45541506
[7] 0.11191879 0.02233908 0.01988878 46.57249043 45.48544918
$value
[1] -215.9711
$convergence
[1] 0
but when I am using another set as for example:
start_param=FloatVector((0.2,0.3,0.4,0.75,0.55,0.45,0.3,0.05,0.05,59,59))
the results change and it seems that I am losing convergence
$par
[1] 0.17218738 0.27165359 0.48458978 0.80295773 0.62618983 0.43254786
[7] 0.12426385 0.02991442 0.01853252 57.78269692 59.35376216
$value
[1] -146.9858
$convergence
[1] 1
My question is the following:
I have seen that in Stata, there is an option that searches for better starting values for the numerical optimization algorithm. I tried to set multiple starting values by setting a matrix but this did not work.
Is there an option in constrOptim that will allow me to do something like this?
Many thanks in advance.
For additional information, the specification I use for the constrOptim() is:
res=statsr.constrOptim(start_param,Rmaxlikelihood,grad='NULL',ui=ui,ci=ci,method="Nelder-Mead",control=list("maxit=3000,trace=F"))
I came across a function in R which does exactly what I was looking for.
The package ‘Rsolnp’ has the function "gosolnp" which is described to perform Random Initialization and Multiple Restarts of the solnp solver.
It is quite efficient and the documentation provides examples on how to use it.
More: http://cran.r-project.org/web/packages/Rsolnp/Rsolnp.pdf