Meaning of `penalty` and `loss` in LinearSVC

Meaning of `penalty` and `loss` in LinearSVC - python

Anti-closing preamble: I have read the question "difference between penalty and loss parameters in Sklearn LinearSVC library" but I find the answer there not to be specific enough. Therefore, I’m reformulating the question:
I am familiar with SVM theory and I’m experimenting with LinearSVC class in Python. However, the documentation is not quite clear regarding the meaning of penalty and loss parameters. I recon that loss refers to the penalty for points violating the margin (usually denoted by the Greek letter xi or zeta in the objective function), while penalty is the norm of the vector determining the class boundary, usually denoted by w. Can anyone confirm or deny this?
If my guess is right, then penalty = 'l1' would lead to minimisation of the L1-norm of the vector w, like in LASSO regression. How does this relate to the maximum-margin idea of the SVM? Can anyone point me to a publication regarding this question? In the original paper describing LIBLINEAR I could not find any reference to L1 penalty.
Also, if my guess is right, why doesn't LinearSVC support the combination of penalty='l2' and loss='hinge' (the standard combination in SVC) when dual=False? When trying it, I get the
ValueError: Unsupported set of arguments

Though very late, I'll try to give my answer. According to the doc, here's the considered primal optimization problem for LinearSVC:
,phi being the Identity matrix, given that LinearSVC only solves linear problems.
Effectively, this is just one of the possible problems that LinearSVC admits (it is the L2-regularized, L1-loss in the terms of the LIBLINEAR paper) and not the default one (which is the L2-regularized, L2-loss).
The LIBLINEAR paper gives a more general formulation for what concerns what's referred to as loss in Chapter 2, then it further elaborates also on what's referred to as penalty within the Appendix (A2+A4).
Basically, it states that LIBLINEAR is meant to solve the following unconstrained optimization pb with different loss functions xi(w;x,y) (which are hinge and squared_hinge); the default setting of the model in LIBLINEAR does not consider the bias term, that's why you won't see any reference to b from now on (there are many posts on SO on this).
, hinge or L1-loss
, squared_hinge or L2-loss.
For what concerns the penalty, basically this represents the norm of the vector w used. The appendix elaborates on the different problems:
L2-regularized, L1-loss (penalty='l2', loss='hinge'):
L2-regularized, L2-loss (penalty='l2', loss='squared_hinge'), default in LinearSVC:
L1-regularized, L2-loss (penalty='l1', loss='squared_hinge'):
Instead, as stated within the documentation, LinearSVC does not support the combination of penalty='l1' and loss='hinge'. As far as I see the paper does not specify why, but I found a possible answer here (within the answer by Arun Iyer).
Eventually, effectively the combination of penalty='l2', loss='hinge', dual=False is not supported as specified in here (it is just not implemented in LIBLINEAR) or here; not sure whether that's the case, but within the LIBLINEAR paper from Appendix B onwards it is specified the optimization pb that's solved (which in the case of L2-regularized, L1-loss seems to be the dual).
For a theoretical discussion on SVC pbs in general, I found that chapter really useful; it shows how the minimization of the norm of w relates to the idea of the maximum-margin.

Related

Results of sklearn/statsmodels ordinary least squares under singular covariance matrix

When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.
The question is then twofold:
What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.

two related questions and answers
Differences in Linear Regression in R and Python
Statsmodels with partly identified model
To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.
Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.
We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.
Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.
To 2) What does it output as coefficients then?
The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)
Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.
A software package has essentially 3 options to handle singular cases
raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse
statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).
One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.

That's indeed the case. As you can see here
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq or scipy.optimize.nnls, which in turn compute the pseudoinverse of the feature matrix via SVD decomposition (they do not exploit the Normal Equation - for which you would have the mentioned issue - as it is less efficient). Moreover, observe that each sklearn.linear_model.LinearRegression's instance returns the singular values of the feature matrix into the singular_ attribute and its rank into the rank_ attribute.
A similar argument applies to statsmodels.regression.linear_model.OLS, where the fit() method of class RegressionModel uses the following:
The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
(see here for reference).

I noticed the same thing, seems sklearn and statsmodel are pretty robust, a little too robust making you wondering how to interprete the results after all. Guess it is still up to the modeler to do due diligence to identify any collinearity between variables and eliminate unnecessary variables. Funny sklearn won't even give you pvalue, which is the most important measure out of these regression. When play with the variables, the coefficient will change, that is why I pay much more attention on pvalues.

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

Determine WHY Features Are Important in Decision Tree Models

Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()
Update:
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Update2:
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .

tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.

The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.

Scikit-learn: Assigning dual_coef_ to multi-class SVM (one vs. one scheme) from another SVM implementation

I had some trouble finding a good transductive svm (semi-supervised support vector machine or s3vm) implementation for python. Finally I found the implementation of Fabian Gieseke of Oldenburg University, Germany (code is here: https://www.ci.uni-oldenburg.de/60506.html, paper title: Fast and Simple Gradient-Based Optimization for Semi-Supervised Support Vector Machines).
I now try to integrate the learned model into my scikit-learn code.
1) This works already:
I've got a binary classification problem. I defined a new method inside the S3VM-code returning the self.__c-coeficients (these are needed for the decision function of the classifier).
I then assign these (in my own scikit-code where clf stands for a svm.SVC-classifier) to clf.dual_coefs_ and properly change clf.support_ too (which holds the indices of the support vectors). It takes a while because sometimes you need numpy-arrays and sometimes lists etc. But it works quite well.
2) This doesnt work at all:
I want to adapt this now to a multi-class-classification problem (again with an svm.SVC-classifier).
I see the scheme for multi-class dual_coef_ in the docs at
http://scikit-learn.org/stable/modules/svm.html
I tried some things already but seem to mess it up all the time. My strategy is as follows:
for all pairs in classes:
calculate the coefficients with qns3vm for the properly binarized labeled training set (filling 0s into spaces in the coef-vector where instances have been in the labeled training set that are not in the current class-pair) --> get a 1x(l+u)-np.array of coefficients
horizontally stack these to get a (n_class*(n_class-1)/2)x(l+u) matrix | I do not have a clue why the specs say that this should be of shape [n_class-1, n_SV(=l+u)]?
replace clf.dual_coef_ with this matrix
Does anybody know the right way to replace dual_coef_ in the multi-class-setting? Or is there a neat piece of example code someone can recommend? Or at least a better explanation for the shape of dual_coef_ in the one-vs-one-multiclass-setting?
Thanks!
Damian

How is Elastic Net used?

This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).
Resource for simple theory? Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?
How to use in sklearn? Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:
from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?

The documentation is lacking. I created a new issue to improve it. As Andreas said the best resource is probably ESL II freely available online as PDF.
To automatically tune the value of alpha it is indeed possible to use ElasticNetCV which will spare redundant computation as apposed to using GridSearchCV in the ElasticNet class for tuning alpha. In complement, you can use a regular GridSearchCV for finding the optimal value of rho. See the docstring of ElasticNetCV fore more details.
As for Lasso vs ElasticNet, ElasticNet will tend to select more variables hence lead to larger models (also more expensive to train) but also be more accurate in general. In particular Lasso is very sensitive to correlation between features and might select randomly one out of 2 very correlated informative features while ElasticNet will be more likely to select both which should lead to a more stable model (in terms of generalization ability so new samples).

I would point you towards this blog post: http://www.datarobot.com/blog/regularized-linear-regression-with-scikit-learn/.

I will try helping you out with the question 'What is ElasticNet?'
The Elastic-Net is a regularised regression method that linearly combines both penalties (i.e.) L1 and L2 of the Lasso and Ridge regression methods.
It is useful when there are multiple correlated features. The difference between Lass and Elastic-Net lies in the fact that Lasso is likely to pick one of these features at random while elastic-net is likely to pick both at once.
The below listed two links have got wonderful explanations for ElasticNet.
ElasticNet- TutorialsPoint
Lasso, Ridge and Elastic Net Regularization

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.