This may not be possible in theory, if so please elaborate.
I am trying to fit some data with Python's sklearn SVM class sklearn SVM class
When I use a linear kernel, I can extract the coefs using get_params method where
coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes,
n_features] Weights assigned to the features (coefficients in the
primal problem). This is only available in the case of linear kernel.
So I can find the equation of best fit that depends on all the independent variables, and am able to use this equation elsewhere.
Is it possible to do the same (get a non-linear equation) from a nonlinear kernel (like the RBF or the polynomial kernel) using sklearn?
Thanks!
Tim
According to the documentation:
The decision function is:
...
This parameters can be accessed through the members dual_coef_ which holds the product y_i alpha_i, support_vectors_ which holds the support vectors, and intercept_ which holds the independent term \rho ...
("support vectors" means the x_i in the decision function equation).
Each kernel has a different function, which you'll need to understand to compute the K(x_i,x) term.
Related
Which algorithm from algebra is used to implement LinearRegression() function in scikit-learn?
From the documentation:
LinearRegression fits a linear model with coefficients w = (w1, ..., wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
So LinearRegression uses OLS i.e. Ordinary Least Squares regression
Sklearn is open source, and the source is available on GitHub.
The code for LinearRegression is here so from the comments:
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
(scipy.optimize.nnls) wrapped as a predictor object.
If you want greater detail, you can read the code.
I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.
RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.
I want to have the formula of the model in order to use it in other languages/projects. Is there a way to export the formula from the model?
I will use sklearn linear regression model.
What I want to do eventually: given a formula f(), and data set 'd', I will have java script code that will give me predictions on d based on f().
The formula can be described essentially by the learned coefficients. The coefficients can be obtained using the attributes coef_ and intercept_. The dot product between the coefficients and the input vector plus the intercept gives the output of the model.
The actual code that implements this "formula" in scikit-learn is something like:
return safe_sparse_dot(X, self.coef_.T,
dense_output=True) + self.intercept_
which should not be too difficult for you to port over to your other project.
I have read somewhere that it's not possible interpret SVM decision values on non-linear kernels, so only the sign matters. However, I saw couple of articles putting a threshold on decision values (with SVMlight though) [1] [2]. So i'm not sure whether putting thresholds on decision values logical as well but i'm curious on the results anyways.
So, LibSVM python interface directly returns the decision values with predicted target when you call predict(), is there any way to do it with scikit-learn? I have trained a binary classification SVM model using svm.SVC(), but got stuck there right now.
In source codes i have found svm.libsvm.decision_function() function commented as "(libsvm name for this is predict_values)". Then i have seen the svm.SVC.decision_function() and checked its source code:
dec_func = libsvm.decision_function(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_, self._label,
self.probA_, self.probB_,
svm_type=LIBSVM_IMPL.index(self._impl),
kernel=kernel, degree=self.degree, cache_size=self.cache_size,
coef0=self.coef0, gamma=self._gamma)
# In binary case, we need to flip the sign of coef, intercept and
# decision function.
if self._impl in ['c_svc', 'nu_svc'] and len(self.classes_) == 2:
return -dec_func
It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction output (except probability estimates and Platt's method, my model is not good when probability estimates are calculated)? Or as it has been argued, the only sign matters for decision value in non-linear kernels?
[1] http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039195#pone.0039195-Teng1
[2] http://link.springer.com/article/10.1007%2Fs00726-011-1100-2
It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
These are just implementation hacks regarding internal representation of class signs. Nothing to truly be worried about.
sklearn decision_function is the value of inner product between SVM's hyerplane w and your data x (possibly in the kernel induced space), so you can use it, shift or analyze. Its interpretation, however is very abstract, as in case of rbf kernel it is simply the integral of the product of normal distribution centered in x with variance equal to 1/(2*gamma) and the weighted sum of normal distributions centered in support vectors (and the same variance), where weights are alpha coefficients.
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction
Platt's scaling is used not because there is some "lobby" forcing us to - simply this is the "correct" way of estimating SVM's confidence. However, if you are not interested in "probability sense" confidence, but rather any value that you can qualitatively compare (which point is more confident) than decision function can be used to do it. It is roughly the distance between the point image in kernel space and the separating hyperplane (up to the normalizing constant being the norm of w). So it is true, that
abs(decision_function(x1)) < abs(decision_function(x2)) => x1 is less confident than x2.
In short - bigger the decision_function value, the "deeper" the point is in its hyperplane.
I have trained a bunch of RBF SVMs using scikits.learn in Python and then Pickled the results. These are for image processing tasks and one thing I want to do for testing is run each classifier on every pixel of some test images. That is, extract the feature vector from a window centered on pixel (i,j), run each classifier on that feature vector, and then move on to the next pixel and repeat. This is far too slow to do with Python.
Clarification: When I say "this is far too slow..." I mean that even the Libsvm under-the-hood code that scikits.learn uses is too slow. I'm actually writing a manual decision function for the GPU so classification at each pixel happens in parallel.
Is it possible for me to load the classifiers with Pickle, and then grab some kind of attribute that describes how the decision is computed from the feature vector, and then pass that info to my own C code? In the case of linear SVMs, I could just extract the weight vector and bias vector and add those as inputs to a C function. But what is the equivalent thing to do for RBF classifiers, and how do I get that info from the scikits.learn object?
Added: First attempts at a solution.
It looks like the classifier object has the attribute support_vectors_ which contains the support vectors as each row of an array. There is also the attribute dual_coef_ which is a 1 by len(support_vectors_) array of coefficients. From the standard tutorials on non-linear SVMs, it appears then that one should do the following:
Compute the feature vector v from your data point under test. This will be a vector that is the same length as the rows of support_vectors_.
For each row i in support_vectors_, compute the squared Euclidean distance d[i] between that support vector and v.
Compute t[i] as gamma * exp{-d[i]} where gamma is the RBF parameter.
Sum up dual_coef_[i] * t[i] over all i. Add the value of the intercept_ attribute of the scikits.learn classifier to this sum.
If the sum is positive, classify as 1. Otherwise, classify as 0.
Added: On numbered page 9 at this documentation link it mentions that indeed the intercept_ attribute of the classifier holds the bias term. I have updated the steps above to reflect this.
Yes your solution looks alright. To pass the raw memory of a numpy array directly to a C program you can use the ctypes helpers from numpy or wrap you C program with cython and call it directly by passing the numpy array (see the doc at http://cython.org for more details).
However, I am not sure that trying to speedup the prediction on a GPU is the easiest approach: kernel support vector machines are known to be slow at prediction time since their complexity directly depend on the number of support vectors which can be high for highly non-linear (multi-modal) problems.
Alternative approaches that are faster at prediction time include neural networks (probably more complicated or slower to train right than SVMs that only have 2 hyper-parameters C and gamma) or transforming your data with a non linear transformation based on distances to prototypes + thresholding + max pooling over image areas (only for image classification).
for the first method you will find good documentation on the deep learning tutorial
for the second read the recent papers by Adam Coates and have a look at this page on kmeans feature extraction
Finally you can also try to use NuSVC models whose regularization parameter nu has a direct impact on the number of support vectors in the fitted model: less support vectors mean faster prediction times (check the accuracy though, it will be a trade-off between prediction speed and accuracy in the end).