I am using the statsmodels.formulas.api.quantreg() for quantile regression in Python. I see that when fitting the quantile regression model, there is an option to specify the significance level for confidence intervals of the regression coefficients, and the confidence interval result appears in the summary of the fit.
What statistical method is being used to generate confidence intervals about the regression coefficients? It does not appear to be documented and I've dug through the source code for quantile_regression.py and summary.py to find this with no luck. Can anyone shed some light on this?
Inference for parameters is the same across models and is mostly inherited from the base classes.
Quantile regression has a model specific covariance matrix of the parameters.
tvalues, pvalues, confidence intervals, t_test and wald_test are all based on the assumption of an asymptotic normal distribution of the estimated parameters with the given covariance, and are "generic".
Linear models like OLS and WLS, and optionally some other models can use the t and F distribution instead of normal and chisquare distribution for the Wald test based inference.
specifically conf_int is defined in statsmodels.base.models.LikelihoodModelResults
partial correction:
QuantReg uses t and F distributions for inference, since it is currently treated as a linear regression model, and not normal and chisquare distributions as the related M-estimators, RLM, in statsmodels.robust.
Most models have now a use_t option to choose the inference distributions, but it hasn't been added to QuantReg.
Related
I would like to fit a multiple linear regression. It will have a couple input parameters and no intercept.
Is it possible to fit a glm fit with statsmodels then use that fit to simulate a distribution of predicted values?
In the next, upcoming 0.14 release, there is a get_distribution method based on predicted distribution parameters that return an instance of a scipy or scipy compatible distribution class. The methods of those include rvs for simulating data. (excluding Tweedie family for which no distribution class exists yet)
Some of this is already available in earlier version.
For some distributions families like gaussian, poisson, binomial the parameters are directly available. For other distributions like negative binomial or gamma there is a conversion from GLM parameterization to standard distribution parameterization.
For example for gaussian family, the dependent variable is normally distributed with mean given by results.fittedvalues or results.predict(x) and variance given by results.scale
using from scipy import stats
scale_n = scale / var_weights
return stats.norm(loc=mu, scale=np.sqrt(scale_n))
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/genmod/families/family.py#L675
Note:
scale in scipy.stats distributions corresponds to the standard deviation, while it corresponds to the variance in GLM and OLS.
var_weights are like weights in WLS for the case when variance differs across observations (heteroscedasticity). By default it is just ones like in OLS and can be ignored. (var_weights are currently only implemented for GLM.)
For non-gaussian families, var_weights is a multiplicative factor for the inherent heteroscedasticity where variance is a function of the mean.
I am looking for a Python package to work on a probabilistic graphical model (PGM) with categorical and continous variables. Some nodes in the PGM are latent variables. The conditional probabilities are defined by continuous functions such as Beta distribution and logistic function. I would like to estimate the parameters of these probability distributions from data using expectation maximization (EM).
I checked pgmpy, and it seems to let model conditional probabilities with Gaussian functions and some other continuous formats. But as far as I understand, pgmpy only allows sampling data points from such PGMs. I am not sure if pgmpy can estimate the parameters (say, mean and std of the Gaussian distribution) in such cases.
Is there a Python package that can:
define the PGM structure by defining the nodes and edges
associate conditional probability function with each edge in form of Beta distribution or similar functions
use a built-in inference algorithm (such as EM) to learn the parameters (say alpha, beta of the Beta distributions) of the conditional probabilities
can anyone let me know what is the method of estimating the parameters in fractional logit model in statsmodel package of python?
And can anyone refer me the specific part of the source code of fractional logit model?
I assume fractional Logit in the question refers to using the Logit model to obtain the quasi-maximum likelihood for continuous data within the interval (0, 1) or [0, 1].
The discrete models in statsmodels like GLM, GEE, and Logit, Probit, Poisson and similar in statsmodels.discrete, do not impose an integer condition on the response or endogenous variable. So those models can be used for fractional or positive continuous data.
The parameter estimates are consistent if the mean function is correctly specified. However, the covariance for the parameter estimates are not correct under quasi-maximum likelihood. The sandwich covariance is available with the fit argument, cov_type='HC0'. Also available are robust sandwich covariance matrices for cluster robust, panel robust or autocorrelation robust cases.
eg.
result = sm.Logit(y, x).fit(cov_type='HC0')
Given that the likelihood is not assumed to be correctly specified, the reported statistics based on the resulting maximized log-likelihood, i.e. llf, ll_null and likelihood ratio tests are not valid.
The only exceptions are multinomial (logit) models which might impose the integer constraint on the explanatory variable, and might or might not work with compositional data. (The support for compositional data with QMLE is still an open question because there are computational advantages to only support the standard cases.)
I'd like to run a logistic regression on a dataset with 0.5% positive class by re-balancing the dataset through class or sample weights. I can do this in scikit learn, but it doesn't provide any of the inferential stats for the model (confidence intervals, p-values, residual analysis).
Is this possible to do in statsmodels? I don't see a sample_weights or class_weights argument in statsmodels.discrete.discrete_model.Logit.fit
Thank you!
programmer's answer:
statsmodels Logit and other discrete models don't have weights yet. (*)
GLM Binomial has implicitly defined case weights through the number of successful and unsuccessful trials per observation. It would also allow manipulating the weights through the GLM variance function, but that is not officially supported and tested yet.
update statsmodels Logit still does not have weights, but GLM has obtained var_weights and freq_weights several statsmodels releases ago. GLM Binomial can be used to estimate a Logit or a Probit model.
statistician's/econometrician's answer:
Inference, standard errors, confidence intervals, tests and so on, are based on having a random sample. If weights are manipulated, then this should affect the inferential statistics.
However, I never looked at the problem for rebalancing the data based on the observed response. In general, this creates a selection bias. A quick internet search shows several answers, from rebalancing doesn't have a positive effect in Logit to penalized estimation as alternative.
One possibility is to also try different link function, cloglog or other link functions have asymmetric or heavier tails that are more appropriate for data with small risk in one class or category.
(*) One problem with implementing weights is to decide what their interpretation is for inference. Stata, for example, allows for 3 kinds of weights.
Given time-series data, I want to find the best fitting logarithmic curve. What are good libraries for doing this in either Python or SQL?
Edit: Specifically, what I'm looking for is a library that can fit data resembling a sigmoid function, with upper and lower horizontal asymptotes.
If your data were categorical, then you could use a logistic regression to fit the probabilities of belonging to a class (classification).
However, I understand you are trying to fit the data to a sigmoid curve, which means you just want to minimize the mean squared error of the fit.
I would redirect you to the SciPy function called scipy.optimize.leastsq: it is used to perform least squares fits.