I am currently working on a multi-layer 1d-CNN. Recently I shifted my work over to an HPC server to train on both CPU and GPU (NVIDIA).
My code runs beautifully (albeit slowly) on my own laptop with TensorFlow 2.7.3. The HPC server I am using has a newer version of python (3.9.0) and TensorFlow installed.
Onto my problem: The Keras callback function "Earlystopping" no longer works as it should on the server. If I set the patience to 5, it will only run for 5 epochs despite specifying epochs = 50 in model.fit(). It seems as if the function is assuming that the val_loss of the first epoch is the lowest value and then runs from there.
I don't know how to fix this. The function would reach lowest val_loss at 15 epochs and run to 20 epochs on my own laptop. On the server, training time and epochs is not sufficient, with very low accuracy (~40%) on test dataset.
Please help.
For some reason, reducing my batch_size from 16 to 8 in model.fit() allowed the EarlyStopping callback to work properly. I don't know why this is though. But it works now.
Related
I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy. With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish. I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.
However, after each epoch of the distributed training, there is a pause of about 8 seconds. When using a single GPU, this pause is less than 1 second. Does anyone know why there is such a long pause after each epoch when training distributed?
Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?
Apparently this had something to do with running TF in graph mode. By setting tf.compat.v1.enable_eager_execution() the problem went away. This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.
I am training a neural network on Google Colab CPU (I cannot use a GPU regarding another issue: FileNotFoundError: No such file: -> Error occuring only on GPU, not on CPU) with the fit_generator method.
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 128,
validation_steps = num_val_samples // 128,
epochs = 10,
use_multiprocessing=True,
workers=6)
The training for the first epoch seems to run fine, but the second does not start. The notebook does not break down or the iteration does not stop. However, the second epoch is not starting...
Is there something wrong with my code?
Heyy
The epoch is very slow because it seems to be calculating validation loss and stuff.This is a common thing. You can only see training progress but not validation progress unless you build a custom callback regarding that.
The issue with your fit_generator is that you dont seem to have understood how to use steps_per_epoch and validation_steps. Unless your validation and train data have same size(number of images) they cant have same number of steps(I mean they "can" but you know what I mean)
I really recommend you use GPU for such data, since it is taking too long on CPU. Try debugging your code because GPU is so worth it.
So I have the following model for sentiment analysis (using pre trained word embeddings):
And as visible, I have a pre trained embedding matrix and only about 500k trainable parameters. So why does it take a whole eternity to train this model? The batch size is 128 and number of epochs is 25. And the ETA for first epoch is about 10 minutes. I haven't even completed that.
Just to mention, I am not using CUDA or anything. I don't think I have a GPU enabled Tensorflow. And I'm willing to do anything to increase the speed. And I have Tensorflow 2.1.0.
And here's the answer I am not using CUDA or anything. Training on CPU is much slower than on GPU. If you don't have high-performance enough video card, you can use several services such as Google Colab or Kaggle
I'm currently learning Tensorflow using the fashion_mnsit dataset. I created a simple neural network with 3 layers, trained the neural net for 10 epochs and then evaluated to unseen data.
My issue arises when I run the script in the terminal(windows). It displays the progress of each epoch with the "loading bar" represented by:
"[===========>.....] "
But once the training finishes. The terminal screen is completely filled with "================" and then at the very end, the result.
Mine:
https://imgur.com/a/KUY8QjQ
Expected:
https://imgur.com/a/P3rh7yA
This is detrimental as I cannot analyze the progression of the model over the epochs.
This is using Tensorflow 2.0 on Python v3.7, 64bit, Windows 10.
Any help appreciated.
I ran into the same issue and found your question, but no answers.
I took a guess and added this to my evaluate:
results = model.evaluate(test_data, test_labels, verbose=0)
The verbose = 0 seems to have resolved the issue for me. No more equal signs.
I'm running the Tensorflow tf.estimator Quickstart example on a Raspberry Pi 3. It works well, however the prediction is very slow.
The instruction predictions = list(classifier.predict(input_fn=predict_input_fn)) takes seconds to return.
It's a problem for me because I want to predict a simple tuple immediately after I receive it from my sensors, I cannot batch the predictions.
My model_dir is a tmpfs folder (something in RAM) so I don't think it's related to IO latency. Perhaps the network is built every time, I don't know.
I'm probably doing something wrong, do you know how to run TensorFlow predictions on the same classifier in a short time ?