SVM, scikit-learn: Decision values with RBF kernel

SVM, scikit-learn: Decision values with RBF kernel - python

I have read somewhere that it's not possible interpret SVM decision values on non-linear kernels, so only the sign matters. However, I saw couple of articles putting a threshold on decision values (with SVMlight though) [1] [2]. So i'm not sure whether putting thresholds on decision values logical as well but i'm curious on the results anyways.
So, LibSVM python interface directly returns the decision values with predicted target when you call predict(), is there any way to do it with scikit-learn? I have trained a binary classification SVM model using svm.SVC(), but got stuck there right now.
In source codes i have found svm.libsvm.decision_function() function commented as "(libsvm name for this is predict_values)". Then i have seen the svm.SVC.decision_function() and checked its source code:
dec_func = libsvm.decision_function(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_, self._label,
self.probA_, self.probB_,
svm_type=LIBSVM_IMPL.index(self._impl),
kernel=kernel, degree=self.degree, cache_size=self.cache_size,
coef0=self.coef0, gamma=self._gamma)
# In binary case, we need to flip the sign of coef, intercept and
# decision function.
if self._impl in ['c_svc', 'nu_svc'] and len(self.classes_) == 2:
return -dec_func
It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction output (except probability estimates and Platt's method, my model is not good when probability estimates are calculated)? Or as it has been argued, the only sign matters for decision value in non-linear kernels?
[1] http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039195#pone.0039195-Teng1
[2] http://link.springer.com/article/10.1007%2Fs00726-011-1100-2

It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
These are just implementation hacks regarding internal representation of class signs. Nothing to truly be worried about.
sklearn decision_function is the value of inner product between SVM's hyerplane w and your data x (possibly in the kernel induced space), so you can use it, shift or analyze. Its interpretation, however is very abstract, as in case of rbf kernel it is simply the integral of the product of normal distribution centered in x with variance equal to 1/(2*gamma) and the weighted sum of normal distributions centered in support vectors (and the same variance), where weights are alpha coefficients.
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction
Platt's scaling is used not because there is some "lobby" forcing us to - simply this is the "correct" way of estimating SVM's confidence. However, if you are not interested in "probability sense" confidence, but rather any value that you can qualitatively compare (which point is more confident) than decision function can be used to do it. It is roughly the distance between the point image in kernel space and the separating hyperplane (up to the normalizing constant being the norm of w). So it is true, that
abs(decision_function(x1)) < abs(decision_function(x2)) => x1 is less confident than x2.
In short - bigger the decision_function value, the "deeper" the point is in its hyperplane.

Related

Can someone give a good math/stats explanation as to what the parameter var_smoothing does for GaussianNB in scikit learn?

I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.

A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.

I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).

Interpreting logistic regression feature coefficient values in sklearn

I have fit a logistic regression model to my data. Imagine, I have four features: 1) which condition the participant received, 2) whether the participant had any prior knowledge/background about the phenomenon tested (binary response in post-experimental questionnaire), 3) time spent on the experimental task, and 4) participant age. I am trying to predict whether participants ultimately chose option A or option B. My logistic regression outputs the following feature coefficients with clf.coef_:
[[-0.68120795 -0.19073737 -2.50511774 0.14956844]]
If option A is my positive class, does this output mean that feature 3 is the most important feature for binary classification and has a negative relationship with participants choosing option A (note: I have not normalized/re-scaled my data)? I want to ensure that my understanding of the coefficients, and the information I can extract from them, is correct so I don't make any generalizations or false assumptions in my analysis.
Thanks for your help!

You are getting to the right track there. If everything is a very similar magnitude, a larger pos/neg coefficient means larger effect, all things being equal.
However, if your data isn't normalized, Marat is correct in that the magnitude of the coefficients don't mean anything (without context). For instance you could get different coefficients by changing the units of measure to be larger or smaller.
I can't see if you've included a non-zero intercept here, but keep in mind that logistic regression coefficients are in fact odds ratios, and you need to transform them to probabilities to get something more directly interpretable.
Check out this page for a good explanation:
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

Logistic regression returns information in log odds. So you must first convert log odds to odds using np.exp and then take odds/(1 + odds).
To convert to probabilities, use a list comprehension and do the following:
[np.exp(x)/(1 + np.exp(x)) for x in clf.coef_[0]]
This page had an explanation in R for converting log odds that I referenced:
https://sebastiansauer.github.io/convert_logit2prob/

Multiple-output Gaussian Process regression in scikit-learn

I am using scikit learn for Gaussian process regression (GPR) operation to predict data. My training data are as follows:
x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points
y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)
The test points (2-D) where mean and variance/standard deviation need to be predicted are:
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()])
x_test = (np.array(positions)).T
Now, after running the GPR (GausianProcessRegressor) fit (Here, the product of ConstantKernel and RBF is used as Kernel in GaussianProcessRegressor), mean and variance/standard deviation can be predicted by following the line of code:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
While printing the predicted mean (y_pred_test) and variance (sigma), I get following output printed in the console:
In the predicted values (mean), the 'nested array' with three objects inside the inner array is printed. It can be presumed that the inner arrays are the predicted mean values of each data source at each 2-D test point locations. However, the printed variance contains only a single array with 16 objects (perhaps for 16 test location points). I know that the variance provides an indication of the uncertainty of the estimation. Hence, I was expecting the predicted variance for each data source at each test point. Is my expectation wrong? How can I get the predicted variance for each data source at each test points? Is it due to wrong code?

Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your sigma is indeed as expected according to the scikit-learn docs on the predict method (i.e. there is no coding error in your case):
Returns:
y_mean : array, shape = (n_samples, [n_output_dims])
Mean of predictive distribution a query points
y_std : array, shape = (n_samples,), optional
Standard deviation of predictive distribution at query points. Only returned when return_std is True.
y_cov : array, shape = (n_samples, n_samples), optional
Covariance of joint predictive distribution a query points. Only returned when return_cov is True.
Combined with my previous remark about the covariance matrix, the first choice would be to try the predict function with the argument return_cov=True instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...
Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
most GPR implementations model only a single response variable, due to
the difficulty in the formulation of covariance function for
correlated multiple response variables, which describes not only the
correlation between data points, but also the correlation between
responses. In the paper we propose a direct formulation of the
covariance function for multi-response GPR, based on the idea that [...]
Despite the high uptake of GPR for various modelling tasks, there
still exists some outstanding issues with the GPR method. Of
particular interest in this paper is the need to model multiple
response variables. Traditionally, one response variable is treated as
a Gaussian process, and multiple responses are modelled independently
without considering their correlation. This pragmatic and
straightforward approach was taken in many applications (e.g. [7, 26,
27]), though it is not ideal. A key to modelling multi-response
Gaussian processes is the formulation of covariance function that
describes not only the correlation between data points, but also the
correlation between responses.
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Typical GPs are usually designed for single-output scenarios wherein
the output is a scalar. However, the multi-output problems have
arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.
The study of MOGP has a long history and is known as multivariate
Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Gaussian process analysis of processes with multiple outputs is
limited by the fact that far fewer good classes of covariance
functions exist compared with the scalar (single-output) case. [...]
The difficulty of finding “good” covariance models for multiple
outputs can have important practical consequences. An incorrect
structure of the covariance matrix can significantly reduce the
efficiency of the uncertainty quantification process, as well as the
forecast efficiency in kriging inferences [16]. Therefore, we argue,
the covariance model may play an even more profound role in co-kriging
[7, 17]. This argument applies when the covariance structure is
inferred from data, as is typically the case.
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).

First of all, if the parameter used is "sigma", that's referring to standard deviation, not variance (recall, variance is just standard deviation squared).
It's easier to conceptualize using variance, since variance is defined as the Euclidean distance from a data point to the mean of the set.
In your case, you have a set of 2D points. If you think of these as points on a 2D plane, then the variance is just the distance from each point to the mean. The standard deviation than would be the positive root of the variance.
In this case, you have 16 test points, and 16 values of standard deviation. This makes perfect sense, since each test point has its own defined distance from the mean of the set.
If you want to compute the variance of the SET of points, you can do that by summing the variance of each point individually, dividing that by the number of points, then subtracting the mean squared. The positive root of this number will yield the standard deviation of the set.
ASIDE: this also means that if you change the set through insertion, deletion, or substitution, the standard deviation of EVERY point will change. This is because the mean will be recomputed to accommodate the new data. This iterative process is the fundamental force behind k-means clustering.

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?

Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.

ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

Statsmodels OLS Regression: Log-likelihood, uses and interpretation

I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?

Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.