Just a heads up: I'm fairly new to the worlds of both machine learning and parallel computing. I'll do my best to use the right terminology, but friendly corrections would be much appreciated.
I've been trying to tune hyperparameters for a Keras MLP (running on top of TensorFlow) by using sklearn's RandomizedSearchCV, wrapping my model in KerasClassifier as per numerous tutorials. I've been trying to parallelize this process via RandomizedSearchCV's built-in system for this. This was working more or less fine, but then when I started using around 6 threads the program would still run, but I would start getting NaN losses, which would lead to errors after the search had concluded.
Now, I know there are a bunch of usual suspects for gradient blow-ups, but there was something interesting here: this problem went away when I reduced pre_dispatch sufficiently (in this case down to n_jobs instead of 2*n_jobs).
Is there any reason why this would happen? My datasets are pretty large, but it seems that once the searching begins, each job is only using about 1-2% of available memory (also, this percent does not change when I cut the pre-dispatched jobs?), and I'm not getting any other memory use issues.
Secondly, on a more practical note, is there any easy way to get my code to break in RandomSearchCV as soon as an NaN loss comes up? It would save some time as opposed to letting it keep running until the end of the search--which, when fully implemented, will probably take a day or more--and then throw the error. I was also thinking of changing error_score to -1 or something. Would this actually be better? I think it is probably worth knowing that certain combinations of hyperparameters lead to gradient blow-ups, but not if its only because of this parallelization issue.
Related
I'm using Tensorflow 2.5 to train a starGAN network for generating images (128x128 jpeg). I am using tf.keras.preprocessing.image_dataset_from_directory to load the images from the subfolders.
Additionally I am using arguments to maximize loading performance as suggested in various posts and threads such as loadedDataset.cache().repeat.prefetch
I'm also using the num_parallel_calls=tf.data.AUTOTUNE for the mapping functions for post-processing the images after loading.
While training the network on GPU the performance I am getting for GPU Utilization is in the picture attached below.
My question regarding this are:
Is the GPU utlization normal or is it not supposed to be so erratic for traning GANs?
Is there any way to make this performance more consistent?
Is there any way to improve the training performance to fully utlize the GPU?
Note that Ive logged my disk I/O also and there is no bottleneck reading/writing from the disk (nvme ssd).
The system has 32GB RAM and a RTX3070 with 8GB Vram. I have tried the running it on colab also; but the performance was similarly erratic.
It is fairly normal for utilization to be erratic like for any kind of parallelized software, including training GANs. Of course, it would be better if you could fully utilize your GPU, but writing software that does this is challenging and becomes virtually impossible when you are talking about complex applications like GANs.
Let me try to demonstrate with a trivial example. Say you have two threads, threadA and threadB. threadA is running the following python code:
x = some_time_comsuming_task()
y = get_y_from_threadB()
print(x+y)
Here threadA is performing lots of calculations to get the value for x, retrieving the value for y, and printing out the sum of x+y. Imagine threadB is also doing some kind of time consuming calculation to generate the value for y. Unless threadA is ready to retrieve y at the exact same time threadB finishes calculating it, you won't have 100% utilization of both threads for the entire duration of the program. And this is just two threads, when you have 100s of threads working together with multiple chained data dependencies, you can see how it becomes exponentially more difficult to eliminate any and all time threads spend waiting on other threads to deliver input to the next step of computation.
Trying to make your "performance more consistent" is pointless. Whether your GPU utilization went up and down (like in the graph you shared) or it stayed exactly at the average utilization for the entire execution would not change the overall execution time, which is probably the actually important metric here. Utilization is mostly useful to identify where you can optimize your code.
Fully utilize? Probably not. As explained in my answer to question one, it's going to be virtually impossible to orchestrate your GAN to completely remove bottlenecks. I would encourage you to try and improve execution time, rather than utilization, when optimizing your GAN. There's no magic setting that you're missing that will completely unlock all of your GPU's potential.
Will probably spelunk more into the Catboost implementation to figure out what's going on, but wanted to check in with the SO community first to make sure I'm not doing anything stupid before I waste my time. I was trying to test out Catboost's early_stopping rounds to speed up parameter search, and was surprised to see that even when I raised up learning rates to stupidly high values the model still fit through all iterations!
Asking for just a quick confirmation that my code looks ok/if anyone's had a similar experience working with Catboost. I've confirmed here that the loss value here changes sporadically as expected, but the fitting continues for the 10 iterations.
Probably the fastest self-answer, but leaving it up here since this wasn't clear to me. Early stopping rounds ONLY works if you have eval_set specified. The more you know.
I'm training a mask-r-cnn network, which is built on tensorflow and keras. I'm searching for a way to reduce training time, so I thought implementing it with tensorflow-distributed.
I've been working with mask-r-cnn for some time, but it seems what I'm trying to do will require me to modify the source code of mask-r-cnn, which is above my current skills.
So, my question is, has someone ever done it, or something similar? is it possible at all, or am I misunderstand the use of tensorflow-distributed.
Thanks ahead.
Even without distributed training, the tensorpack implementation of Mask R-CNN in tensorflow runs 5x faster (and more accurate as well) than the one you linked to.
It also supports distributed training with MPI.
I need to fit a deep neural network to data coming from a data generating process, think of an AR(5). So I have five features per observation and one y for some large number N observations in each simulation. I am interested only in the root mean squared error of the best performing DNN in each simulation.
Since it's a simulation setting, I have to do a large number of these simulations and within each simulation fit a neural network to the data. The only reasonable way I can think of doing this is fit the DNN via hyper-parameter optimisation given each simulation (dlib's find_min_global will be my optimiser).
Does it make sense to do this exercise in C++ (slow development because I am not proficient) or Python (faster iteration because I am fairly proficient).
From where I am sitting, C++ or Python might not make much of a difference in execution time, because the model has to be compiled each time the optimiser proposes a new hyper-parameter vector (am I wrong here?).
If it is possible to compile once, and test all hyper-parameters between the lower and upper bounds, then C++ would be my go to solution(Is this possible in any of the open source DNN languages?).
If anyone has done this exercise before, please advice.
Thank you all for your help.
See looking at your problem, one way to implement this is to use genetic/evolutionary algorithm. Considering that I understood your problem correctly, you want to sweep through all the hyper-parameters to get the get the best solution.
So, I would recommend using python for this and tensorflow, keras all support this. So this might not be a problem.
Note - If I understood your question differently, then please feel free to correct me.
I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.
Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.