Statsmodels OLS Regression: Log-likelihood, uses and interpretation

Statsmodels OLS Regression: Log-likelihood, uses and interpretation - python

I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?

Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).

Related

Results of sklearn/statsmodels ordinary least squares under singular covariance matrix

When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.
The question is then twofold:
What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.

two related questions and answers
Differences in Linear Regression in R and Python
Statsmodels with partly identified model
To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.
Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.
We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.
Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.
To 2) What does it output as coefficients then?
The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)
Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.
A software package has essentially 3 options to handle singular cases
raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse
statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).
One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.

That's indeed the case. As you can see here
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq or scipy.optimize.nnls, which in turn compute the pseudoinverse of the feature matrix via SVD decomposition (they do not exploit the Normal Equation - for which you would have the mentioned issue - as it is less efficient). Moreover, observe that each sklearn.linear_model.LinearRegression's instance returns the singular values of the feature matrix into the singular_ attribute and its rank into the rank_ attribute.
A similar argument applies to statsmodels.regression.linear_model.OLS, where the fit() method of class RegressionModel uses the following:
The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
(see here for reference).

I noticed the same thing, seems sklearn and statsmodel are pretty robust, a little too robust making you wondering how to interprete the results after all. Guess it is still up to the modeler to do due diligence to identify any collinearity between variables and eliminate unnecessary variables. Funny sklearn won't even give you pvalue, which is the most important measure out of these regression. When play with the variables, the coefficient will change, that is why I pay much more attention on pvalues.

Curve_fit - Set maximum allowed error for a given coefficient returned during fitting for a model function

I am using scipy.curve_fit to fit a sum of gaussian functions to histogram data taken from an experiment. I can get a decent enough looking overall fit, with an acceptable combined Root-Mean-Sqaure-Error for the fit model as a whole, but I would like to do more with the individual root mean square error on each derived coefficient as it propagates to later calculations. I am using the var_matrix returned from curve_fit to extract the RMSE for each individual parameter.
Is it possible to set a maximum error allowed per coefficient during fitting e.g. 10% of a given coefficients value. I have initial guess parameters and bounds for each coefficient, but My concern is I have 21 fitted coefficients and some of them have miniscule error while others have errors larger than the coefficient value it's self which leads me to believe the fitting is done solely based on the total error of the model, which may be "shoving" a lot of error on a given coefficient to make it easier for the other parameters.

Can someone give a good math/stats explanation as to what the parameter var_smoothing does for GaussianNB in scikit learn?

I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.

A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.

I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).

Multiple-output Gaussian Process regression in scikit-learn

I am using scikit learn for Gaussian process regression (GPR) operation to predict data. My training data are as follows:
x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points
y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)
The test points (2-D) where mean and variance/standard deviation need to be predicted are:
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()])
x_test = (np.array(positions)).T
Now, after running the GPR (GausianProcessRegressor) fit (Here, the product of ConstantKernel and RBF is used as Kernel in GaussianProcessRegressor), mean and variance/standard deviation can be predicted by following the line of code:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
While printing the predicted mean (y_pred_test) and variance (sigma), I get following output printed in the console:
In the predicted values (mean), the 'nested array' with three objects inside the inner array is printed. It can be presumed that the inner arrays are the predicted mean values of each data source at each 2-D test point locations. However, the printed variance contains only a single array with 16 objects (perhaps for 16 test location points). I know that the variance provides an indication of the uncertainty of the estimation. Hence, I was expecting the predicted variance for each data source at each test point. Is my expectation wrong? How can I get the predicted variance for each data source at each test points? Is it due to wrong code?

Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your sigma is indeed as expected according to the scikit-learn docs on the predict method (i.e. there is no coding error in your case):
Returns:
y_mean : array, shape = (n_samples, [n_output_dims])
Mean of predictive distribution a query points
y_std : array, shape = (n_samples,), optional
Standard deviation of predictive distribution at query points. Only returned when return_std is True.
y_cov : array, shape = (n_samples, n_samples), optional
Covariance of joint predictive distribution a query points. Only returned when return_cov is True.
Combined with my previous remark about the covariance matrix, the first choice would be to try the predict function with the argument return_cov=True instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...
Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
most GPR implementations model only a single response variable, due to
the difficulty in the formulation of covariance function for
correlated multiple response variables, which describes not only the
correlation between data points, but also the correlation between
responses. In the paper we propose a direct formulation of the
covariance function for multi-response GPR, based on the idea that [...]
Despite the high uptake of GPR for various modelling tasks, there
still exists some outstanding issues with the GPR method. Of
particular interest in this paper is the need to model multiple
response variables. Traditionally, one response variable is treated as
a Gaussian process, and multiple responses are modelled independently
without considering their correlation. This pragmatic and
straightforward approach was taken in many applications (e.g. [7, 26,
27]), though it is not ideal. A key to modelling multi-response
Gaussian processes is the formulation of covariance function that
describes not only the correlation between data points, but also the
correlation between responses.
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Typical GPs are usually designed for single-output scenarios wherein
the output is a scalar. However, the multi-output problems have
arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.
The study of MOGP has a long history and is known as multivariate
Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Gaussian process analysis of processes with multiple outputs is
limited by the fact that far fewer good classes of covariance
functions exist compared with the scalar (single-output) case. [...]
The difficulty of finding “good” covariance models for multiple
outputs can have important practical consequences. An incorrect
structure of the covariance matrix can significantly reduce the
efficiency of the uncertainty quantification process, as well as the
forecast efficiency in kriging inferences [16]. Therefore, we argue,
the covariance model may play an even more profound role in co-kriging
[7, 17]. This argument applies when the covariance structure is
inferred from data, as is typically the case.
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).

First of all, if the parameter used is "sigma", that's referring to standard deviation, not variance (recall, variance is just standard deviation squared).
It's easier to conceptualize using variance, since variance is defined as the Euclidean distance from a data point to the mean of the set.
In your case, you have a set of 2D points. If you think of these as points on a 2D plane, then the variance is just the distance from each point to the mean. The standard deviation than would be the positive root of the variance.
In this case, you have 16 test points, and 16 values of standard deviation. This makes perfect sense, since each test point has its own defined distance from the mean of the set.
If you want to compute the variance of the SET of points, you can do that by summing the variance of each point individually, dividing that by the number of points, then subtracting the mean squared. The positive root of this number will yield the standard deviation of the set.
ASIDE: this also means that if you change the set through insertion, deletion, or substitution, the standard deviation of EVERY point will change. This is because the mean will be recomputed to accommodate the new data. This iterative process is the fundamental force behind k-means clustering.

statsmodel fractional logit model

can anyone let me know what is the method of estimating the parameters in fractional logit model in statsmodel package of python?
And can anyone refer me the specific part of the source code of fractional logit model?

I assume fractional Logit in the question refers to using the Logit model to obtain the quasi-maximum likelihood for continuous data within the interval (0, 1) or [0, 1].
The discrete models in statsmodels like GLM, GEE, and Logit, Probit, Poisson and similar in statsmodels.discrete, do not impose an integer condition on the response or endogenous variable. So those models can be used for fractional or positive continuous data.
The parameter estimates are consistent if the mean function is correctly specified. However, the covariance for the parameter estimates are not correct under quasi-maximum likelihood. The sandwich covariance is available with the fit argument, cov_type='HC0'. Also available are robust sandwich covariance matrices for cluster robust, panel robust or autocorrelation robust cases.
eg.
result = sm.Logit(y, x).fit(cov_type='HC0')
Given that the likelihood is not assumed to be correctly specified, the reported statistics based on the resulting maximized log-likelihood, i.e. llf, ll_null and likelihood ratio tests are not valid.
The only exceptions are multinomial (logit) models which might impose the integer constraint on the explanatory variable, and might or might not work with compositional data. (The support for compositional data with QMLE is still an open question because there are computational advantages to only support the standard cases.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.