I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.
I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).
I have fit a logistic regression model to my data. Imagine, I have four features: 1) which condition the participant received, 2) whether the participant had any prior knowledge/background about the phenomenon tested (binary response in post-experimental questionnaire), 3) time spent on the experimental task, and 4) participant age. I am trying to predict whether participants ultimately chose option A or option B. My logistic regression outputs the following feature coefficients with clf.coef_:
[[-0.68120795 -0.19073737 -2.50511774 0.14956844]]
If option A is my positive class, does this output mean that feature 3 is the most important feature for binary classification and has a negative relationship with participants choosing option A (note: I have not normalized/re-scaled my data)? I want to ensure that my understanding of the coefficients, and the information I can extract from them, is correct so I don't make any generalizations or false assumptions in my analysis.
Thanks for your help!
You are getting to the right track there. If everything is a very similar magnitude, a larger pos/neg coefficient means larger effect, all things being equal.
However, if your data isn't normalized, Marat is correct in that the magnitude of the coefficients don't mean anything (without context). For instance you could get different coefficients by changing the units of measure to be larger or smaller.
I can't see if you've included a non-zero intercept here, but keep in mind that logistic regression coefficients are in fact odds ratios, and you need to transform them to probabilities to get something more directly interpretable.
Check out this page for a good explanation:
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/
Logistic regression returns information in log odds. So you must first convert log odds to odds using np.exp and then take odds/(1 + odds).
To convert to probabilities, use a list comprehension and do the following:
[np.exp(x)/(1 + np.exp(x)) for x in clf.coef_[0]]
This page had an explanation in R for converting log odds that I referenced:
https://sebastiansauer.github.io/convert_logit2prob/
I am using scikit learn for Gaussian process regression (GPR) operation to predict data. My training data are as follows:
x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points
y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)
The test points (2-D) where mean and variance/standard deviation need to be predicted are:
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()])
x_test = (np.array(positions)).T
Now, after running the GPR (GausianProcessRegressor) fit (Here, the product of ConstantKernel and RBF is used as Kernel in GaussianProcessRegressor), mean and variance/standard deviation can be predicted by following the line of code:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
While printing the predicted mean (y_pred_test) and variance (sigma), I get following output printed in the console:
In the predicted values (mean), the 'nested array' with three objects inside the inner array is printed. It can be presumed that the inner arrays are the predicted mean values of each data source at each 2-D test point locations. However, the printed variance contains only a single array with 16 objects (perhaps for 16 test location points). I know that the variance provides an indication of the uncertainty of the estimation. Hence, I was expecting the predicted variance for each data source at each test point. Is my expectation wrong? How can I get the predicted variance for each data source at each test points? Is it due to wrong code?
Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your sigma is indeed as expected according to the scikit-learn docs on the predict method (i.e. there is no coding error in your case):
Returns:
y_mean : array, shape = (n_samples, [n_output_dims])
Mean of predictive distribution a query points
y_std : array, shape = (n_samples,), optional
Standard deviation of predictive distribution at query points. Only returned when return_std is True.
y_cov : array, shape = (n_samples, n_samples), optional
Covariance of joint predictive distribution a query points. Only returned when return_cov is True.
Combined with my previous remark about the covariance matrix, the first choice would be to try the predict function with the argument return_cov=True instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...
Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
most GPR implementations model only a single response variable, due to
the difficulty in the formulation of covariance function for
correlated multiple response variables, which describes not only the
correlation between data points, but also the correlation between
responses. In the paper we propose a direct formulation of the
covariance function for multi-response GPR, based on the idea that [...]
Despite the high uptake of GPR for various modelling tasks, there
still exists some outstanding issues with the GPR method. Of
particular interest in this paper is the need to model multiple
response variables. Traditionally, one response variable is treated as
a Gaussian process, and multiple responses are modelled independently
without considering their correlation. This pragmatic and
straightforward approach was taken in many applications (e.g. [7, 26,
27]), though it is not ideal. A key to modelling multi-response
Gaussian processes is the formulation of covariance function that
describes not only the correlation between data points, but also the
correlation between responses.
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Typical GPs are usually designed for single-output scenarios wherein
the output is a scalar. However, the multi-output problems have
arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.
The study of MOGP has a long history and is known as multivariate
Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Gaussian process analysis of processes with multiple outputs is
limited by the fact that far fewer good classes of covariance
functions exist compared with the scalar (single-output) case. [...]
The difficulty of finding “good” covariance models for multiple
outputs can have important practical consequences. An incorrect
structure of the covariance matrix can significantly reduce the
efficiency of the uncertainty quantification process, as well as the
forecast efficiency in kriging inferences [16]. Therefore, we argue,
the covariance model may play an even more profound role in co-kriging
[7, 17]. This argument applies when the covariance structure is
inferred from data, as is typically the case.
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).
First of all, if the parameter used is "sigma", that's referring to standard deviation, not variance (recall, variance is just standard deviation squared).
It's easier to conceptualize using variance, since variance is defined as the Euclidean distance from a data point to the mean of the set.
In your case, you have a set of 2D points. If you think of these as points on a 2D plane, then the variance is just the distance from each point to the mean. The standard deviation than would be the positive root of the variance.
In this case, you have 16 test points, and 16 values of standard deviation. This makes perfect sense, since each test point has its own defined distance from the mean of the set.
If you want to compute the variance of the SET of points, you can do that by summing the variance of each point individually, dividing that by the number of points, then subtracting the mean squared. The positive root of this number will yield the standard deviation of the set.
ASIDE: this also means that if you change the set through insertion, deletion, or substitution, the standard deviation of EVERY point will change. This is because the mean will be recomputed to accommodate the new data. This iterative process is the fundamental force behind k-means clustering.
I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.
I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?
Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).