Software:
I am using Python 3.7.3 and stats model library.
Problem:
I am trying to model a continuous target with zero inflated distribution (zeros are around 12,2% of the target variable). The target variable is annual income in US$. When i fit a Generalized Linear Model using a Gamma Distribution, the model can't capture the zero values, and the minimum predicted value is around US$ 5000.
I made a quick research and i read about zero-inflated GLM and Hurdle Models but Stats model, as far i know, doesn't support that kind of models. So, my question is, Is there a Python-Library for that? if no, what can i do?
Related
I have a neural network which maps a set of 4 floating-point input parameters to a set of 10 floating point outputs trained on a dataset of ~300 points. The points themselves are intrinsically multi-modal, and there are some sparse areas in the training set that I don't currently have any good way to gather data for (although in real-world deployment they will eventually be encountered).
The data itself trained as-expected during training (test-split loss value was uniformly decreasing during the training, and the errors are all within acceptable levels). So I believe these model is mapping the variables well against each other. However, I'm concerned over how well 'generalized' the model is within areas where it doesn't have training data.
So I'm looking to add an additional output to the model to provide a "closeness" estimate to the training points. My implementation currently is to use scipy to calculate a gauissian KDE from the 4 parameters and then check the points closeness to the training space based on that. Then, in deployment, I return a warning/error if the inputs are too far from the space the model was trained on. This works okay but I have to pass the entire test "X" set around with the model which is a little inconvenient and kludgy.
Is there a way to embed this closeness estimate in the model itself? Or is there any more formalized way to handle this (ex., to give a "confidence" estimate in the model output)?
I think you want to relook at loss value of your model. At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.
The loss value gives you this closeness estimate between data and
model output.
This loss value can be accessed in the history object returned while fitting the tensorflow model.
>>>history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
epochs=10, verbose=1)
>>>print(history.history.keys())
dict_keys(['loss'])
If you want closeness estimate between two different models, you need KL Divergence(Kullback–Leibler divergence) or relative entropy.
From wiki
KL Divergence is a type of statistical distance: a measure of how one
probability distribution P is different from a second, reference
probability distribution Q.A simple interpretation of the KL
divergence of P from Q is the expected excess surprise from using Q as
a model when the actual distribution is P.
In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information.
KLDivergence is used to distill knowledge of a teacher model into a student model and check if both of them have identical quantities of information.
Trivia: Knowledge distillation is used to compress deep learning model size with little bit compromise in quality.
KLDivergence can be directly imported in tensorflow as follows:
from tensorflow.keras.losses import KLDivergence
You can check a full fledged KD implementation using keras in https://keras.io/examples/vision/knowledge_distillation/
I want to estimate the following model using Gamma log-linked GLM and 2SLS:
'Unemployment_total~log_usd_prize_y+Exports_gs+GDP_growth+CPI+log_GDP_per_capita+Country+C(t_Year)+Income_level+Income_level:log_usd_prize_y'
The first stage being the following:
'log_usd_prize_y~log_Internet_s+log_t_count+Exports_gs+GDP_growth+CPI+log_GDP_per_capita+Country+C(t_Year)'
My log(usd_prize_y) is endogenous and I want to estimate it manually since Python does not have any package for estimating 2SLS using GLMs.
I have theorized that I would have to perform the first stage estimation using OLS, but I'm very unsure, and I'm not sure how to estimate the first stage using GLMs since the log-linked model has a different expected value formula.
Any insight would be helpful. Thanks
Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.
I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.
I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).
I'd like to run a logistic regression on a dataset with 0.5% positive class by re-balancing the dataset through class or sample weights. I can do this in scikit learn, but it doesn't provide any of the inferential stats for the model (confidence intervals, p-values, residual analysis).
Is this possible to do in statsmodels? I don't see a sample_weights or class_weights argument in statsmodels.discrete.discrete_model.Logit.fit
Thank you!
programmer's answer:
statsmodels Logit and other discrete models don't have weights yet. (*)
GLM Binomial has implicitly defined case weights through the number of successful and unsuccessful trials per observation. It would also allow manipulating the weights through the GLM variance function, but that is not officially supported and tested yet.
update statsmodels Logit still does not have weights, but GLM has obtained var_weights and freq_weights several statsmodels releases ago. GLM Binomial can be used to estimate a Logit or a Probit model.
statistician's/econometrician's answer:
Inference, standard errors, confidence intervals, tests and so on, are based on having a random sample. If weights are manipulated, then this should affect the inferential statistics.
However, I never looked at the problem for rebalancing the data based on the observed response. In general, this creates a selection bias. A quick internet search shows several answers, from rebalancing doesn't have a positive effect in Logit to penalized estimation as alternative.
One possibility is to also try different link function, cloglog or other link functions have asymmetric or heavier tails that are more appropriate for data with small risk in one class or category.
(*) One problem with implementing weights is to decide what their interpretation is for inference. Stata, for example, allows for 3 kinds of weights.