I'd like to run a logistic regression on a dataset with 0.5% positive class by re-balancing the dataset through class or sample weights. I can do this in scikit learn, but it doesn't provide any of the inferential stats for the model (confidence intervals, p-values, residual analysis).
Is this possible to do in statsmodels? I don't see a sample_weights or class_weights argument in statsmodels.discrete.discrete_model.Logit.fit
Thank you!
programmer's answer:
statsmodels Logit and other discrete models don't have weights yet. (*)
GLM Binomial has implicitly defined case weights through the number of successful and unsuccessful trials per observation. It would also allow manipulating the weights through the GLM variance function, but that is not officially supported and tested yet.
update statsmodels Logit still does not have weights, but GLM has obtained var_weights and freq_weights several statsmodels releases ago. GLM Binomial can be used to estimate a Logit or a Probit model.
statistician's/econometrician's answer:
Inference, standard errors, confidence intervals, tests and so on, are based on having a random sample. If weights are manipulated, then this should affect the inferential statistics.
However, I never looked at the problem for rebalancing the data based on the observed response. In general, this creates a selection bias. A quick internet search shows several answers, from rebalancing doesn't have a positive effect in Logit to penalized estimation as alternative.
One possibility is to also try different link function, cloglog or other link functions have asymmetric or heavier tails that are more appropriate for data with small risk in one class or category.
(*) One problem with implementing weights is to decide what their interpretation is for inference. Stata, for example, allows for 3 kinds of weights.
Related
I have a neural network which maps a set of 4 floating-point input parameters to a set of 10 floating point outputs trained on a dataset of ~300 points. The points themselves are intrinsically multi-modal, and there are some sparse areas in the training set that I don't currently have any good way to gather data for (although in real-world deployment they will eventually be encountered).
The data itself trained as-expected during training (test-split loss value was uniformly decreasing during the training, and the errors are all within acceptable levels). So I believe these model is mapping the variables well against each other. However, I'm concerned over how well 'generalized' the model is within areas where it doesn't have training data.
So I'm looking to add an additional output to the model to provide a "closeness" estimate to the training points. My implementation currently is to use scipy to calculate a gauissian KDE from the 4 parameters and then check the points closeness to the training space based on that. Then, in deployment, I return a warning/error if the inputs are too far from the space the model was trained on. This works okay but I have to pass the entire test "X" set around with the model which is a little inconvenient and kludgy.
Is there a way to embed this closeness estimate in the model itself? Or is there any more formalized way to handle this (ex., to give a "confidence" estimate in the model output)?
I think you want to relook at loss value of your model. At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.
The loss value gives you this closeness estimate between data and
model output.
This loss value can be accessed in the history object returned while fitting the tensorflow model.
>>>history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
epochs=10, verbose=1)
>>>print(history.history.keys())
dict_keys(['loss'])
If you want closeness estimate between two different models, you need KL Divergence(Kullback–Leibler divergence) or relative entropy.
From wiki
KL Divergence is a type of statistical distance: a measure of how one
probability distribution P is different from a second, reference
probability distribution Q.A simple interpretation of the KL
divergence of P from Q is the expected excess surprise from using Q as
a model when the actual distribution is P.
In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information.
KLDivergence is used to distill knowledge of a teacher model into a student model and check if both of them have identical quantities of information.
Trivia: Knowledge distillation is used to compress deep learning model size with little bit compromise in quality.
KLDivergence can be directly imported in tensorflow as follows:
from tensorflow.keras.losses import KLDivergence
You can check a full fledged KD implementation using keras in https://keras.io/examples/vision/knowledge_distillation/
Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
Software:
I am using Python 3.7.3 and stats model library.
Problem:
I am trying to model a continuous target with zero inflated distribution (zeros are around 12,2% of the target variable). The target variable is annual income in US$. When i fit a Generalized Linear Model using a Gamma Distribution, the model can't capture the zero values, and the minimum predicted value is around US$ 5000.
I made a quick research and i read about zero-inflated GLM and Hurdle Models but Stats model, as far i know, doesn't support that kind of models. So, my question is, Is there a Python-Library for that? if no, what can i do?
I am using the statsmodels.formulas.api.quantreg() for quantile regression in Python. I see that when fitting the quantile regression model, there is an option to specify the significance level for confidence intervals of the regression coefficients, and the confidence interval result appears in the summary of the fit.
What statistical method is being used to generate confidence intervals about the regression coefficients? It does not appear to be documented and I've dug through the source code for quantile_regression.py and summary.py to find this with no luck. Can anyone shed some light on this?
Inference for parameters is the same across models and is mostly inherited from the base classes.
Quantile regression has a model specific covariance matrix of the parameters.
tvalues, pvalues, confidence intervals, t_test and wald_test are all based on the assumption of an asymptotic normal distribution of the estimated parameters with the given covariance, and are "generic".
Linear models like OLS and WLS, and optionally some other models can use the t and F distribution instead of normal and chisquare distribution for the Wald test based inference.
specifically conf_int is defined in statsmodels.base.models.LikelihoodModelResults
partial correction:
QuantReg uses t and F distributions for inference, since it is currently treated as a linear regression model, and not normal and chisquare distributions as the related M-estimators, RLM, in statsmodels.robust.
Most models have now a use_t option to choose the inference distributions, but it hasn't been added to QuantReg.