Which algorithm from algebra is used to implement LinearRegression() function in scikit-learn?
From the documentation:
LinearRegression fits a linear model with coefficients w = (w1, ..., wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
So LinearRegression uses OLS i.e. Ordinary Least Squares regression
Sklearn is open source, and the source is available on GitHub.
The code for LinearRegression is here so from the comments:
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
(scipy.optimize.nnls) wrapped as a predictor object.
If you want greater detail, you can read the code.
Related
I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.
RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.
I want to have the formula of the model in order to use it in other languages/projects. Is there a way to export the formula from the model?
I will use sklearn linear regression model.
What I want to do eventually: given a formula f(), and data set 'd', I will have java script code that will give me predictions on d based on f().
The formula can be described essentially by the learned coefficients. The coefficients can be obtained using the attributes coef_ and intercept_. The dot product between the coefficients and the input vector plus the intercept gives the output of the model.
The actual code that implements this "formula" in scikit-learn is something like:
return safe_sparse_dot(X, self.coef_.T,
dense_output=True) + self.intercept_
which should not be too difficult for you to port over to your other project.
This may not be possible in theory, if so please elaborate.
I am trying to fit some data with Python's sklearn SVM class sklearn SVM class
When I use a linear kernel, I can extract the coefs using get_params method where
coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes,
n_features] Weights assigned to the features (coefficients in the
primal problem). This is only available in the case of linear kernel.
So I can find the equation of best fit that depends on all the independent variables, and am able to use this equation elsewhere.
Is it possible to do the same (get a non-linear equation) from a nonlinear kernel (like the RBF or the polynomial kernel) using sklearn?
Thanks!
Tim
According to the documentation:
The decision function is:
...
This parameters can be accessed through the members dual_coef_ which holds the product y_i alpha_i, support_vectors_ which holds the support vectors, and intercept_ which holds the independent term \rho ...
("support vectors" means the x_i in the decision function equation).
Each kernel has a different function, which you'll need to understand to compute the K(x_i,x) term.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)
I am implementing SVR using sklearn svr package in python. My sparse matrix is of size 146860 x 10202. I have divided it into various sub-matrices of size 2500 x 10202. For each sub matrix, SVR fitting is taking about 10 mins.
What could be the ways to speed up the process? Please suggest any different approach or different python package for the same.
Thanks!
You can average the SVR sub-models predictions.
Alternatively you can try to fit a linear regression model on the output of kernel expansion computed with the Nystroem method.
Or you can try other non-linear regression models such as ensemble of randomized trees or gradient boosted regression trees.
Edit: I forgot to say: the kernel SVR model itself is not scalable as its complexity is more than quadratic hence there is no way to "speed it up".
Edit 2: Actually, often scaling the input variables to [0, 1] or [-1, 1] or to unit variance using StandardScaler can speed up the convergence by quite a bit.
Also it is very unlikely that the default parameters will yield good results: you have to grid search the optimal value for gamma and maybe also epsilon on a sub samples of increasing sizes (to check the stability of the optimal parameters) before fitting to large models.