Tensorflow distributed training pause after each epoch - python

I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy. With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish. I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.
However, after each epoch of the distributed training, there is a pause of about 8 seconds. When using a single GPU, this pause is less than 1 second. Does anyone know why there is such a long pause after each epoch when training distributed?
Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?

Apparently this had something to do with running TF in graph mode. By setting tf.compat.v1.enable_eager_execution() the problem went away. This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.

Related

Graph initialized on every epoch after tensorflow upgrade

i've been running my tensorflow code without any issues on TF2.4.
Meaning, the first epoch training was slow, from my understanding because the graph stuff was initialized.
After this, the following epochs were executing fast.
Now I upgraded to TF2.10 and on every epoch I get the message that the loop optimizer was skipped.
This is not the issue, just an indicator that the initial graph stuff is now executed with every epoch.
Therefore my training is now as slow for every epoch, like it was in the first epoch with TF2.4
Does anyone know why this happens, and how to fix it?
I tried to disable the grappler loop optimizer but it did not resolve the issue.
I found the reason:
I had memory leak problems in the past and created a Callback with a tf.keras.backend.clear_session() call.
This seems to destroy the graph in TF2.10.

Keras Earlystopping not working, too few epochs

I am currently working on a multi-layer 1d-CNN. Recently I shifted my work over to an HPC server to train on both CPU and GPU (NVIDIA).
My code runs beautifully (albeit slowly) on my own laptop with TensorFlow 2.7.3. The HPC server I am using has a newer version of python (3.9.0) and TensorFlow installed.
Onto my problem: The Keras callback function "Earlystopping" no longer works as it should on the server. If I set the patience to 5, it will only run for 5 epochs despite specifying epochs = 50 in model.fit(). It seems as if the function is assuming that the val_loss of the first epoch is the lowest value and then runs from there.
I don't know how to fix this. The function would reach lowest val_loss at 15 epochs and run to 20 epochs on my own laptop. On the server, training time and epochs is not sufficient, with very low accuracy (~40%) on test dataset.
Please help.
For some reason, reducing my batch_size from 16 to 8 in model.fit() allowed the EarlyStopping callback to work properly. I don't know why this is though. But it works now.

Best way to import data in google-colaboratory for fast computing and training?

I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds
Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.

How do I find bottlenecks in my model?

I have been testing with a word2vec model. This word2vec model for some reason doesn't use the gpu much. My performance is roughly 1 epoch every 30 seconds with a ~2000 samples dataset.
This doesn't seem normal. There are researchers that have gigabytes of training data, and I doubt they are waiting for months for the training to finish.
My GPU is a gtx 970. The memory usage is around 10% (Note that I have a few programs open too)
The problem might be the batches itself, although I am not sure.
Basically I run a method at the start of the training, and then while training I iterate over the entries in that list.
This is roughly how I do this.
Is my approach wrong? (I would guess that it's not suitable for huge datasets)
batch_method(batch_size=x) # I tested with different sizes, all seem to train fine, from 2 to 512.
for epo in self.epochs_num:
for batch in self.batch_list:
for input,target in batch:
...

Running Neural Networks Takes Longer Time

I'm running a series of neural networks (Keras library using a Tensorflow backend), and I have the following results for the time it took to train each neural network in Jupyter Notebook:
ELAPSED TIME: 2.7005105018615723
0
ELAPSED TIME: 2.4810903072357178
1
ELAPSED TIME: 2.801435708999634
2
ELAPSED TIME: 2.6753993034362793
3
ELAPSED TIME: 2.8625667095184326
4
ELAPSED TIME: 2.5828065872192383
5
while later on you have:
ELAPSED TIME: 5.062163829803467
0
ELAPSED TIME: 5.162402868270874
1
ELAPSED TIME: 5.301288366317749
2
ELAPSED TIME: 5.386904001235962
3
ELAPSED TIME: 6.126806020736694
4
The program consists of a function that trains a separate neural network model on their respective datasets, and only exports their final training accuracy (saved to another file).
I thought the reason why it took longer for the latter networks to train was because the program was consuming too much memory, so I would delete the models (with the del keyword) after having obtained their training accuracy, but that doesn't seem to be doing much.
If I were to restart the Jupyter Notebook kernel, the time to run each network would shorten back to about 2 seconds (original duration), but it takes longer for the latter models to run.
What could be possible reasons for this, and what solutions could be implemented?
.
NOTE: I did not include any code because it would make this post more dense, but if necessary I can upload it.
Are you running on an NVIDA GPU? If so it's possible some part of the old model is still on the GPU. Try running nvidia-smi while the slow model is running and see if anything else is using up GPU memory/resources.
If that doesn't work you can also run the tensorflow timeline and compare between the slow and fast runs. More info on how to generate the timeline in Keras here: https://github.com/tensorflow/tensorflow/issues/9868 and the code I pasted below for creating the timeline was taken from that link
from tensorflow.python.client import timeline
# Make your keras model
# ...
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
model.compile(loss='MSE', optimizer='Adam', options=run_options, run_metadata=run_metadata)
# Run model in your usual way
# ...
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.ctf.json', 'w') as f:
f.write(trace.generate_chrome_trace_format())
For more info on the tensorflow timeline see https://stackoverflow.com/a/37774470/2826818 The timeline will allow you to see how much time each operation is taking and determine which one is causing the slow down.
You can clear your session after you are done with each model, this fixed the issue for me.
from keras import backend as K
for model in models_to_train:
model.fit(X, y)
# save necessary metrics
K.clear_session()
(Also I was getting a segmentation fault until I update tensoflow-gpu to 1.9.0)

Categories

Resources