Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?
Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.
I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.
Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)
Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)
Related
Anti-closing preamble: I have read the question "difference between penalty and loss parameters in Sklearn LinearSVC library" but I find the answer there not to be specific enough. Therefore, I’m reformulating the question:
I am familiar with SVM theory and I’m experimenting with LinearSVC class in Python. However, the documentation is not quite clear regarding the meaning of penalty and loss parameters. I recon that loss refers to the penalty for points violating the margin (usually denoted by the Greek letter xi or zeta in the objective function), while penalty is the norm of the vector determining the class boundary, usually denoted by w. Can anyone confirm or deny this?
If my guess is right, then penalty = 'l1' would lead to minimisation of the L1-norm of the vector w, like in LASSO regression. How does this relate to the maximum-margin idea of the SVM? Can anyone point me to a publication regarding this question? In the original paper describing LIBLINEAR I could not find any reference to L1 penalty.
Also, if my guess is right, why doesn't LinearSVC support the combination of penalty='l2' and loss='hinge' (the standard combination in SVC) when dual=False? When trying it, I get the
ValueError: Unsupported set of arguments
Though very late, I'll try to give my answer. According to the doc, here's the considered primal optimization problem for LinearSVC:
,phi being the Identity matrix, given that LinearSVC only solves linear problems.
Effectively, this is just one of the possible problems that LinearSVC admits (it is the L2-regularized, L1-loss in the terms of the LIBLINEAR paper) and not the default one (which is the L2-regularized, L2-loss).
The LIBLINEAR paper gives a more general formulation for what concerns what's referred to as loss in Chapter 2, then it further elaborates also on what's referred to as penalty within the Appendix (A2+A4).
Basically, it states that LIBLINEAR is meant to solve the following unconstrained optimization pb with different loss functions xi(w;x,y) (which are hinge and squared_hinge); the default setting of the model in LIBLINEAR does not consider the bias term, that's why you won't see any reference to b from now on (there are many posts on SO on this).
, hinge or L1-loss
, squared_hinge or L2-loss.
For what concerns the penalty, basically this represents the norm of the vector w used. The appendix elaborates on the different problems:
L2-regularized, L1-loss (penalty='l2', loss='hinge'):
L2-regularized, L2-loss (penalty='l2', loss='squared_hinge'), default in LinearSVC:
L1-regularized, L2-loss (penalty='l1', loss='squared_hinge'):
Instead, as stated within the documentation, LinearSVC does not support the combination of penalty='l1' and loss='hinge'. As far as I see the paper does not specify why, but I found a possible answer here (within the answer by Arun Iyer).
Eventually, effectively the combination of penalty='l2', loss='hinge', dual=False is not supported as specified in here (it is just not implemented in LIBLINEAR) or here; not sure whether that's the case, but within the LIBLINEAR paper from Appendix B onwards it is specified the optimization pb that's solved (which in the case of L2-regularized, L1-loss seems to be the dual).
For a theoretical discussion on SVC pbs in general, I found that chapter really useful; it shows how the minimization of the norm of w relates to the idea of the maximum-margin.
I'm using LassoCV in sklearn. There's always a lot of ConvergenceWarnings. I checked the results and they look good. So I was wondering if it is safe to simply ignore those warnings. A few thoughts I have are:
Does the warning happens because the magnitude of my response is too big? It will make the loss bigger, too.
In most cases, would it be solved when I increase the number of iterations? I'm hesitating to do it because sometimes it takes longer to run but the results didn't improve.
tl;dr It is fine almost always, to be sure watch learning curve.
So, LassoCV implements Lasso regression, which parameters are optimized via some kind of gradient descent (coordinate descent to be more precise, which is even simpler method) and as all gradient methods this approach requires defining:
step size
stop criterion
probably most popular stop criteria are:
a) fixed amount of steps (good choice time-wise, since 1000 steps takes exactly x1000 time compare to 1 step, so it is easy to manage the time spent on training).
b) fixed delta (difference) between values of a loss function during step n and n-1 (possibly better classification/regression quality)
The warning you observe is because LassoCV uses the the first criterion (fixed amount of steps), but also checks for the second (delta), once number of fixed steps is reached the algorithm stops, default value of delta is too small for most real datasets.
To be sure that you train your model long enough you could plot a learning curve: loss value after every 10-20-50 steps of the training, once it goes to a plateau you are good to stop.
I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.
I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).
Somehow for my implementation, I have to define the weights first and can not use high level functions in tensorflow like tf.layers.batch_normalization or tf.layers.dense. So to do batch normalization I need to use tf.nn.batch_normalization. I know that for calculating mean and variance of each minibatch I can use tf.nn.moments, but what about moving mean and variance? Does anyone have experience with doing that or knows an example of implementation? I see that people talk about using tf.nn.batch_normalization could be tricky so I want to know the complexity of doing that. In other words, what makes it tricky and what points I should be careful about during my implementation? Is there any other point besides moving average and variance that I should be aware of?
You have to be wary of the terms running_mean and running_variance. In mathematics and in traditional cocmputer science, they are referred to as methods which compute these values without having seen the complete data. They are also known as online versions of mean and variance. Not that they are able to accurately determine the mean and variance beforehand. They simply keep on updaing the value of some variables mean and variance, as more data comes in. If your data size is finite, then once having seen the complete data, their values would match the values one would compute, if the complete data is made available.
The case of batch normalization is different. You should not think of running mean and running variance in the same way as in the above paragraph.
Training Time
During training, the mean and variance are computed for a batch. They is not running mean or running variance. So, you can safely use tf.nn.moments to do that.
Testing Time
During testing time, you use what should be called as population_estimated_mean and population_estimated_variance. These quantities are computed during training but are not directly used. They are computed for later use during testing time.
Now one pitfall is that, some people might want to use the Knuth Formula for computing these quantities. This is not advisable. Why? : Because, training is done over several epochs. So, the same dataset is seen for as many times as the number of epochs. Since data augmentation is also usually random, computing the standard running mean and running variance, could be dangerous. Instead what is usually used is an exponentially decaying estimate.
You can achieve this, by using tf.train.ExponentialMovingAverage over batch_mean and batch_variance. Here you specify that how much relevance will be given to past samples vis-a-vis present samples. Be sure that the variables you use for computing this should be non-trainable by setting trainable=False.
During test time, you will use these variables as mean and variance.
For more details on implementation you can take a look at this link.
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.