Is it safe to ignore the "ConvergenceWarning" in sklearn?

Is it safe to ignore the "ConvergenceWarning" in sklearn? - python

I'm using LassoCV in sklearn. There's always a lot of ConvergenceWarnings. I checked the results and they look good. So I was wondering if it is safe to simply ignore those warnings. A few thoughts I have are:
Does the warning happens because the magnitude of my response is too big? It will make the loss bigger, too.
In most cases, would it be solved when I increase the number of iterations? I'm hesitating to do it because sometimes it takes longer to run but the results didn't improve.

tl;dr It is fine almost always, to be sure watch learning curve.
So, LassoCV implements Lasso regression, which parameters are optimized via some kind of gradient descent (coordinate descent to be more precise, which is even simpler method) and as all gradient methods this approach requires defining:
step size
stop criterion
probably most popular stop criteria are:
a) fixed amount of steps (good choice time-wise, since 1000 steps takes exactly x1000 time compare to 1 step, so it is easy to manage the time spent on training).
b) fixed delta (difference) between values of a loss function during step n and n-1 (possibly better classification/regression quality)
The warning you observe is because LassoCV uses the the first criterion (fixed amount of steps), but also checks for the second (delta), once number of fixed steps is reached the algorithm stops, default value of delta is too small for most real datasets.
To be sure that you train your model long enough you could plot a learning curve: loss value after every 10-20-50 steps of the training, once it goes to a plateau you are good to stop.

Related

"reg_alpha" parameter in XGBoost regressor. Is it bad to use high values?

I'm performing hyperparameter tuning with grid search and I realized that I was getting overfitting... I tried a lot of ways to reduce it, changing the "gamma", "subsample", "max_depth" parameters to reduce it, but I was still overfitting...
Then, I increased the "reg_alpha" parameters value to > 30....and them my model reduced overfitting drastically. I know that this parameter refers to L1 regularization term on weights, and maybe that's why solved my problem.
I just want to know if it has any problem using high values for reg_alpha like this?
I would appreciate your help :D

reg_alpha penalizes the features which increase cost function. Meaning it finds the features that doesn't increase accuracy. But this makes prediction line smoother.
On some problems I also increase reg_alpha > 30 because it reduces both overfitting and test error.
But if it is a regression problem it's prediction will be close to mean on test set and it will maybe not catch anomalies good.
So i may say you can increase it as long as your test accuracy doesn't begin to fall down.
Lastly when increasing reg_alpha , keeping max_depth small might be a good practise.

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

Why is my Deep Q Net and Double Deep Q Net unstable?

I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
Sometimes it can achieve the perfect 200 score within 100 episodes, but sometimes it gets stuck and only achieves 10 score no matter how long it is trained.
Also, in case of successful learning, the learning speed differ.
The problem in DDQN is:
It can learn to achieve 200 score, but then it seems to forget what's learned and the score drops dramatically.
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!

These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.

Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity

I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:

I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x.
In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)

How to take care of moving mean and moving variance using tf.nn.batch_normalization?

Somehow for my implementation, I have to define the weights first and can not use high level functions in tensorflow like tf.layers.batch_normalization or tf.layers.dense. So to do batch normalization I need to use tf.nn.batch_normalization. I know that for calculating mean and variance of each minibatch I can use tf.nn.moments, but what about moving mean and variance? Does anyone have experience with doing that or knows an example of implementation? I see that people talk about using tf.nn.batch_normalization could be tricky so I want to know the complexity of doing that. In other words, what makes it tricky and what points I should be careful about during my implementation? Is there any other point besides moving average and variance that I should be aware of?

You have to be wary of the terms running_mean and running_variance. In mathematics and in traditional cocmputer science, they are referred to as methods which compute these values without having seen the complete data. They are also known as online versions of mean and variance. Not that they are able to accurately determine the mean and variance beforehand. They simply keep on updaing the value of some variables mean and variance, as more data comes in. If your data size is finite, then once having seen the complete data, their values would match the values one would compute, if the complete data is made available.
The case of batch normalization is different. You should not think of running mean and running variance in the same way as in the above paragraph.
Training Time
During training, the mean and variance are computed for a batch. They is not running mean or running variance. So, you can safely use tf.nn.moments to do that.
Testing Time
During testing time, you use what should be called as population_estimated_mean and population_estimated_variance. These quantities are computed during training but are not directly used. They are computed for later use during testing time.
Now one pitfall is that, some people might want to use the Knuth Formula for computing these quantities. This is not advisable. Why? : Because, training is done over several epochs. So, the same dataset is seen for as many times as the number of epochs. Since data augmentation is also usually random, computing the standard running mean and running variance, could be dangerous. Instead what is usually used is an exponentially decaying estimate.
You can achieve this, by using tf.train.ExponentialMovingAverage over batch_mean and batch_variance. Here you specify that how much relevance will be given to past samples vis-a-vis present samples. Be sure that the variables you use for computing this should be non-trainable by setting trainable=False.
During test time, you will use these variables as mean and variance.
For more details on implementation you can take a look at this link.

How to speed up up Stochastic Gradient Descent?

I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.

Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.