I'm using a certain StatsModels distribution (Azzalini's Skew Student-t) and I'd like to perform a (one-sample) Kolmogorov-Smirnov test with it.
Is it possible to use Scipy's kstest with a StatsModels distribution? Scipy's documentation (rather vaguely) suggests that the cdf argument may be a String or a callable, with no further details or examples about the latter.
On the other hand, the StatsModels' distribution I'm using has many of the methods that Scipy distributions do; thus, I'm supposing there is some way of using it as a callable argument passed to kstest. Am I wrong?
Here is what I have so far. What I'd like to achieve is commented out in the last line:
import statsmodels.sandbox.distributions.extras as azt
import scipy.stats as stats
x = ([-0.2833379 , -3.05224565, 0.13236267, -0.24549146, -1.75106484,
0.95375723, 0.28628686, 0. , -3.82529261, -0.26714159,
1.07142857, 2.56183746, -1.89491817, -0.3414301 , 1.11589663,
-0.74540174, -0.60470106, -1.93307821, 1.56093656, 1.28078818])
# This is how kstest works.
print stats.kstest(x, stats.norm.cdf) #(0.21003262911224113, 0.29814145956367311)
# This is Statsmodels' distribution I'm using. It has a cdf function as well.
ast = azt.ACSkewT_gen()
# This is what I'd want. Executing this will throw a TypeError because ast.cdf
# needs some shape parameters etc.
# print stats.kstest(x, ast.cdf)
Note: I'll happily use two-sample KS test if what I'm expecting is not possible. Just wanted to know if this is possible.
Those functions have been written a long time ago with scipy compatibility in mind. But there were several changes in scipy in the meantime.
kstest has an args keyword for the distribution parameters.
To get the distribution parameters we can try to estimate them by using the fit method of the scipy.stats distributions. However, estimating all parameters prints some warnings and the estimated df parameter is large. If we fix df at specific values we get estimates without warnings that we can use in the call of kstest.
>>> ast.fit(x)
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\integrate\quadpack.py:352: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\integrate\quadpack.py:352: IntegrationWarning: The integral is probably divergent, or slowly convergent.
warnings.warn(msg, IntegrationWarning)
(31834.800527154337, -2.3475921468088172, 1.3720725621594987, 2.2766515091760722)
>>> p = ast.fit(x, f0=100)
>>> print(stats.kstest(x, ast.cdf, args=p))
(0.13897385693057401, 0.83458552699682509)
>>> p = ast.fit(x, f0=5)
>>> print(stats.kstest(x, ast.cdf, args=p))
(0.097960232618178544, 0.990756154198281)
However, the distribution for the Kolmogorov-Smirnov test assumes that the distribution parameters are fixed and not estimated. If we estimate the parameters as above, then the p-value will not be correct since it is not based on the correct distribution.
For some distributions we can use tables for the kstest with estimated mean and scale parameter, e.g. the Lilliefors test kstest_normal in statsmodels. If we have estimated shape parameters, then the distribution of the ks test statistics will depend on the parameters of the model, and we could get the pvalue from bootstrapping.
(I don't remember anything about estimating the parameters of the SkewT distribution and whether maximum likelihood estimation has any specific problems.)
Related
I am looking for help about the implementation of a logit model with statsmodel for binary variables.
Here is my code :
(I am using feature selection methods : MinimumRedundancyMaximumRelevance and RecursiveFeatureElimination available on python)
for i_mrmr in range(4,20):
for i_rfe in range(3,i_mrmr):
regressors_step1 = I am selecting the MRMR features
regressors_step2 = I am selecting features from the previous list with RFE method
for method in ['newton', 'nm', 'bfgs', 'lbfgs', 'powell', 'cg', 'ncg']:
logit_model = Logit(y,X.loc[:,regressors_step2])
try:
result = logit_model.fit(method=method, cov_type='HC1')
print(result.summary)
except:
result = "error"
I am using Logit from statsmodels.discrete.discrete_model.Logit.
The y variable, the target, is a binary.
All explanatory variables, the X, are binary too.
The logit model is "functionning" for the different optimization methods. That is to say, I end up with some summary to print. Nonetheless, various warnings print such as : "Maximum Likelihood optimization failed to converge."
The optimization methods presented in the statsmodel algorithm are the ones from scipy :
‘newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
‘bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
‘lbfgs’ for limited-memory BFGS with optional box constraints
‘powell’ for modified Powell’s method
‘cg’ for conjugate gradient
‘ncg’ for Newton-conjugate gradient
We can find these methods on scipy.optimize.
Here are my questions :
I did not find anywhere any argument against the use of these optimization methods for a binary set of variables. But, because of the warnings, I am asking myself if it is correct to do so. And then, what is the best method, the one which is the more appropriate in this case ?
Here : Scipy minimize: how to restrict x only to 0 and 1? it is implicitely said that a model of the kind Python MIP (Mixed-Integer Linear Programming) could be better in the binary set of variables case. In the documentation of the MIP package of python it appears that to implement this kind of model I should explicitly give a function to minimize or maximize and also I should express the constraints... (see : https://docs.python-mip.com/en/latest/quickstart.html#creating-models)
Therefore I am wondering if i need to define a logit function as the objective function ? What are the constraints I should express ? Is there any easier way to do ?
In my program, I am applying box cox transform to my data and I am interested to reverse the box-cox transformation at a certain step through my experiment. However I noticed there are two variants of boxcox:
scipy.special.boxcox
scipy.stats.boxcox
I learned that the first option has a function that reverses the box cox transform here.
However I just want to know why in scipy.special the lambda parameter cannot be None while in scipy.stats it could be. In my code I am actually using scipy.stats and the Lamda is None. Now if I want to revert to using scipy.special in order to use its reverse function, what should I set lamda to ?
Here is my current code:
elif self.output_box:
y_train, self.y_train_lambda_ = boxcox(y_train)
y_test, self.y_test_lambda_ = boxcox(y_test)
They both use the same formula for the transformation so it seems that the only difference is that with scipy.stats you can calculate the optimal lambda for the data. If you use scipy.stats.boxcox with lambda=None it returns two parameters: the transformed array and the lambda that maximizes the log-likelihood function (and if alpha is not None too it returns the confidence interval for lambda). Therefore, that’s the lambda that you have to use with the inverse transformation.
My understanding is both ways should give the quantile corresponding to lower tail probability. However, I get different results.
e.g:- qgeom(0.99,0.5) gives 6 in R, wheres geom.ppf(0.99,0.5) gives 7 in Python.
tldr; the pmf's of the geometric distribution are different in R and SciPy.
First off, it's good to confirm that generally quantiles calculated in R and Python agree, for example in the case of the normal distribution
from scipy.stats import norm
norm.ppf(0.99)
#2.3263478740408408
qnorm(0.99)
#[1] 2.326348
For the case of the geometric distribution, the quantile functions differ because the probability mass functions (pmf) are different. In R, the pmf of the geometric distribution is defined as p(1 - p)^k (see help("Geometric")); in Python's SciPy module the geometric distribution is defined as p(1 - p)^(k-1) (see scipy.stats.geom).
You can find a summary of key quantities for both definitions in the Wikipedia article. In essence, the ^k definition is "used for modeling the number of failures until the first success", where as the ^(k-1) definition relates to "the probability that the kth trial (out of k trials) is the first success".
See also: Which geometric distribution to use?
The function scipy.stats.linregress automatically calculates the standard error of the fitted slope. How do I get the standard error of the fitted intercept?
One alternative would be to use the pyfinance.ols module, which has separate standard error attributes for the intercept (alpha) and other coefficients. Disclosure: I wrote this module. It was uploaded to PyPI recently for easier install.
A quick example:
from pyfinance.ols import OLS
x = np.random.randn(50)
y = np.random.beta(1, 2, 50)
model = OLS(y=y, x=x)
model.se_alpha
# 0.029413047270740914
Under the hood, the class is adding a column vector of ones, and the alpha/intercept term is just a normal coefficient to this vector. Unlike with statsmodels and sklearn, .fit() is effectively called at class instantiation.
I'm using the differential_evolution algorithm in scipy to fit some data with various exponential functions convolved with gaussian functions - this in itself is not a problem, the function fits it well.
However, it is not giving the jacobian in the result dictionary (which I would like to use to calculate the errors on my fit constants), despite the fact that I have set "polish" (i.e. use scipy.optimize.minimize with the L-BFGS-B method to polish the best population member at the end) to True, and thus the documentation states it should give the jacobian. My function takes the gaussian width and any number of exponents, and is being fit like so:
result = differential_evolution(exponentialfit, bounds, args=(avgspectra, c, fitfrom, errors, numcomponents, 1), tol=0.000000000001, disp=True, polish=True)
Is there any reason it is not giving the jacobian in the result output?