Will probably spelunk more into the Catboost implementation to figure out what's going on, but wanted to check in with the SO community first to make sure I'm not doing anything stupid before I waste my time. I was trying to test out Catboost's early_stopping rounds to speed up parameter search, and was surprised to see that even when I raised up learning rates to stupidly high values the model still fit through all iterations!
Asking for just a quick confirmation that my code looks ok/if anyone's had a similar experience working with Catboost. I've confirmed here that the loss value here changes sporadically as expected, but the fitting continues for the 10 iterations.
Probably the fastest self-answer, but leaving it up here since this wasn't clear to me. Early stopping rounds ONLY works if you have eval_set specified. The more you know.
Related
May I know should I stop the training process at epoch 19, since the overfitting happens right after that.
Or it is still ok to be used since the difference is not too big.
I would like to ask for clarification regarding how small is the difference should we stop the training process.
Thank you.
Graph of "Loss vs Epoch" and "Acc vs Epoch"
i guess, i'd like to know more about your problem, your network architecture and so on, but given what you provided, i would test first with different batch sizes before making any decision about stopping at epoch 19. Take a look at this.
I think you should study early stopping; this article can help you learn more about Early Stopping to avoid overfitting. It can help you not worry about the number of epochs and the value of the loss.
Just a heads up: I'm fairly new to the worlds of both machine learning and parallel computing. I'll do my best to use the right terminology, but friendly corrections would be much appreciated.
I've been trying to tune hyperparameters for a Keras MLP (running on top of TensorFlow) by using sklearn's RandomizedSearchCV, wrapping my model in KerasClassifier as per numerous tutorials. I've been trying to parallelize this process via RandomizedSearchCV's built-in system for this. This was working more or less fine, but then when I started using around 6 threads the program would still run, but I would start getting NaN losses, which would lead to errors after the search had concluded.
Now, I know there are a bunch of usual suspects for gradient blow-ups, but there was something interesting here: this problem went away when I reduced pre_dispatch sufficiently (in this case down to n_jobs instead of 2*n_jobs).
Is there any reason why this would happen? My datasets are pretty large, but it seems that once the searching begins, each job is only using about 1-2% of available memory (also, this percent does not change when I cut the pre-dispatched jobs?), and I'm not getting any other memory use issues.
Secondly, on a more practical note, is there any easy way to get my code to break in RandomSearchCV as soon as an NaN loss comes up? It would save some time as opposed to letting it keep running until the end of the search--which, when fully implemented, will probably take a day or more--and then throw the error. I was also thinking of changing error_score to -1 or something. Would this actually be better? I think it is probably worth knowing that certain combinations of hyperparameters lead to gradient blow-ups, but not if its only because of this parallelization issue.
I tried pack-pad technique. It does reduce many computing time a lot, but my accuracy is significantly worse than former one.
It sounds make sense with approximation theory that keep longer computing time you will get more fine value. But this time it is neural network. I don't think training by embedded zero vector will get me more accuracy. Please correct me if I am wrong.
Here is my files. If you would like to see my code.
This is plain one with 50% accuracy
pack-pad with 25% accuracy
It is not a homework assignment. It is my self-study.
Questions:
1. Am I get a correct result?
2. Does pack-pad sacrifices accuracy?
PS:
I feel my question would fit to stackoverflow most than Datascience or CodeReview. Please let me move it if need.
The TensorFlow tutorial for using CNN for the cifar10 data set has the following advice:
EXERCISE: When experimenting, it is sometimes annoying that the first training step can take so long. Try decreasing the number of images that initially fill up the queue. Search for NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN in cifar10.py.
In order to play around with it, I tried decreasing this number by a lot but it doesn't seem to change the training time. Is there anything I can do? I tried even changing it to something as low as 5 and the training session still continued very slowly.
Any help would be appreciated!
Note that this exercise only speeds up the first step time by skipping the prefetching of a larger from of the data. This exercise does not speed up the overall training
That said, the tutorial text needs to be updated. It should read
Search for min_fraction_of_examples_in_queue in cifar10_input.py.
If you lower this number, the first step should be much quicker because the model will not attempt to prefetch the input.
I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.
Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.