i've been running my tensorflow code without any issues on TF2.4.
Meaning, the first epoch training was slow, from my understanding because the graph stuff was initialized.
After this, the following epochs were executing fast.
Now I upgraded to TF2.10 and on every epoch I get the message that the loop optimizer was skipped.
This is not the issue, just an indicator that the initial graph stuff is now executed with every epoch.
Therefore my training is now as slow for every epoch, like it was in the first epoch with TF2.4
Does anyone know why this happens, and how to fix it?
I tried to disable the grappler loop optimizer but it did not resolve the issue.
I found the reason:
I had memory leak problems in the past and created a Callback with a tf.keras.backend.clear_session() call.
This seems to destroy the graph in TF2.10.
Related
I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy. With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish. I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.
However, after each epoch of the distributed training, there is a pause of about 8 seconds. When using a single GPU, this pause is less than 1 second. Does anyone know why there is such a long pause after each epoch when training distributed?
Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?
Apparently this had something to do with running TF in graph mode. By setting tf.compat.v1.enable_eager_execution() the problem went away. This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.
I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...
I am working on a audio set to train a neural network using tensorflow library but there is a weird issue that I can't figure out. So I am following this blog Urban Sound Classification, the only difference is that I have my own dataset.
So everything is working fine if I have small data like about 30 audio files or so but when I use the complete data my training code simply runs couple of iterations outputs cost and then that is about it, no error, exception or warning is thrown the tensorflow session simply doesn't give any further results. Let's see the code to better explanation:
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
_,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
cost_history = np.append(cost_history,cost)
y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
y_true = sess.run(tf.argmax(ts_labels,1))
print("Test accuracy: ",round(session.run(accuracy,
feed_dict={X: ts_features,Y: ts_labels}),3))
So when I run the above code for training on complete data (about 9000 files) it generates cost history for about 2 epochs and then stop generation history but the code keeps on executing like normal jus the session.run() stop outputting results. My guess is that due to some exception the session stops but how do I debug this stupid error? I have nothing to go on. Can anyone advise on this?
Note: I am not sure if this is the right forum but point me in right direction I will move the question if need be.
UPDATE 01:
So I have figured some correlation between the amount of data/learning rate and the error. Here is my understanding of what is happening. So when I was coding I used subset of my original data about 10-15 files for training and the learning rate was 0.01 and it worked well (as in it completed all it's epochs).
When I used 500 files for training it repeated the same behavior as described in original question (it would output 2 iterations and then kaboom not more outputs and no exception or error). I noticed in those iterations that cost was increasing so I tried to lower the learning rate, and viola it worked like a charm with a new learning rate of 0.001. (again all epochs ran and successfully outputted the results)
Finally when I run the training for all of my data that is about 9000 files but I observed the same behavior as previously discussed. So my question now is how much should I lower the learning rate? What is the correlation of the learning rate to the amount of data?
With a Keras model, I've included the TensorBoard callback to generate logs files to be visualised later.
The problem is that if I train my model multiple times, it generates multiple logs files, and the step number always restart to 0 instead of continuing on the last step of the previous run.
This leads to inexploitable graph in TensorBoard (screenshot below).
With raw Tensorflow, I've seen this could be solved by adding a "global_step" tensor to keep track of the epoch number between the runs.
But how can I do this using Keras ?
model.fit has an argument initial_epoch, 0 by default, that lets you tell the model which epoch it's starting at. You can use this to resume a previous training.
So I've a CNN implemented. I have made custom callbacks that are confirmed working but I have an issue.
This is a sample output.
Example of iteration 5 (batch-size of 10,000 for simplicity)
50000/60000 [========================>.....] - ETA: 10s ('new lr:', 0.01)
('accuracy:', 0.70)
I have 2 callbacks (tested to work as shown in the output):
(1) Changes the learning rate at each iteration. (2) Prints the accuracy at each iteration.
I have an external script that determines the learning rate by taking in the accuracy.
Question:
How to make the accuracy at each iteration available so that an external script can access it? In essence an accessible variable at each iteration. I'm able to access it only once the process is over with AccuracyCallback.accuracy
Problem
I can pass a changing learning rate. But how do I get the accuracy once this has been passed in a form of an accessible variable at each iteration?
Example
My external script determines the learning rate at iteration 1: 0.01. How do I get the accuracy as an accessible variable in my external script at iteration 1 instead of a print statement?
You can create your own callback
class AccCallback(keras.callbacks.Callback):
def on_batch_end(self, batch, logs={}):
accuracy = logs.get('acc')
# pass accuracy to your 'external' script and set new lr here
In order for logs.get('acc') to work, you have to tell Keras to monitor it:
model.compile(optimizer='...', loss='...', metrics=['accuracy'])
Lastly, note that the type of accuracy is ndarray here. Should it cause you any issue, I suggest wrapping it: float(accuracy).