How to speed up up Stochastic Gradient Descent?

How to speed up up Stochastic Gradient Descent? - python

I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.

Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.

Related

Is it safe to ignore the "ConvergenceWarning" in sklearn?

I'm using LassoCV in sklearn. There's always a lot of ConvergenceWarnings. I checked the results and they look good. So I was wondering if it is safe to simply ignore those warnings. A few thoughts I have are:
Does the warning happens because the magnitude of my response is too big? It will make the loss bigger, too.
In most cases, would it be solved when I increase the number of iterations? I'm hesitating to do it because sometimes it takes longer to run but the results didn't improve.

tl;dr It is fine almost always, to be sure watch learning curve.
So, LassoCV implements Lasso regression, which parameters are optimized via some kind of gradient descent (coordinate descent to be more precise, which is even simpler method) and as all gradient methods this approach requires defining:
step size
stop criterion
probably most popular stop criteria are:
a) fixed amount of steps (good choice time-wise, since 1000 steps takes exactly x1000 time compare to 1 step, so it is easy to manage the time spent on training).
b) fixed delta (difference) between values of a loss function during step n and n-1 (possibly better classification/regression quality)
The warning you observe is because LassoCV uses the the first criterion (fixed amount of steps), but also checks for the second (delta), once number of fixed steps is reached the algorithm stops, default value of delta is too small for most real datasets.
To be sure that you train your model long enough you could plot a learning curve: loss value after every 10-20-50 steps of the training, once it goes to a plateau you are good to stop.

RandomizedSearchCV in parallel: Too many dispatched jobs causing NaN loss?

Just a heads up: I'm fairly new to the worlds of both machine learning and parallel computing. I'll do my best to use the right terminology, but friendly corrections would be much appreciated.
I've been trying to tune hyperparameters for a Keras MLP (running on top of TensorFlow) by using sklearn's RandomizedSearchCV, wrapping my model in KerasClassifier as per numerous tutorials. I've been trying to parallelize this process via RandomizedSearchCV's built-in system for this. This was working more or less fine, but then when I started using around 6 threads the program would still run, but I would start getting NaN losses, which would lead to errors after the search had concluded.
Now, I know there are a bunch of usual suspects for gradient blow-ups, but there was something interesting here: this problem went away when I reduced pre_dispatch sufficiently (in this case down to n_jobs instead of 2*n_jobs).
Is there any reason why this would happen? My datasets are pretty large, but it seems that once the searching begins, each job is only using about 1-2% of available memory (also, this percent does not change when I cut the pre-dispatched jobs?), and I'm not getting any other memory use issues.
Secondly, on a more practical note, is there any easy way to get my code to break in RandomSearchCV as soon as an NaN loss comes up? It would save some time as opposed to letting it keep running until the end of the search--which, when fully implemented, will probably take a day or more--and then throw the error. I was also thinking of changing error_score to -1 or something. Would this actually be better? I think it is probably worth knowing that certain combinations of hyperparameters lead to gradient blow-ups, but not if its only because of this parallelization issue.

Does pack-pad sacrifices accuracy?

I tried pack-pad technique. It does reduce many computing time a lot, but my accuracy is significantly worse than former one.
It sounds make sense with approximation theory that keep longer computing time you will get more fine value. But this time it is neural network. I don't think training by embedded zero vector will get me more accuracy. Please correct me if I am wrong.
Here is my files. If you would like to see my code.
This is plain one with 50% accuracy
pack-pad with 25% accuracy
It is not a homework assignment. It is my self-study.
Questions:
1. Am I get a correct result?
2. Does pack-pad sacrifices accuracy?
PS:
I feel my question would fit to stackoverflow most than Datascience or CodeReview. Please let me move it if need.

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

Large memory consumption with Ridge

I'm trying to do cross-validation with scikit-learn, and I'm running into some memory issues that are hard to figure out.
Basically, I've found that when I increase the number of hyperparameters searched, or when I increase the number of cross-validation loops for a GridSearchCV object, I get a nearly linear increase in memory consumption. This makes for dangerously high memory consumption if I use large enough matrices.
Here's a little gist of what I'm talking about:
http://nbviewer.ipython.org/gist/choldgraf/6a7be7866f2a3a3d3f98
Does anyone know why this might be happening? It seems that GridSearchCV is basically just looping through the cv object and the model parameter options in a list comprehension-style. It doesn't seem like that should increase memory usage...
UPDATE: After looking into this a bit more, it turns out that the problem isn't with GridSearchCV, but rather with a few of the solvers in Ridge (I have updated the gist accordingly). It may be a problem with the scipy linear algebra libraries, see issue here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.