I am training an English-Vietnamese NMT model using fairseq.
fairseq tells that it is training the model on 1 GPU. However, when I check the GPU, It seems not to be used and the training process is very slow.
screenshot: GPU usage
Training on 63k sentences corpus: an epoch takes about 1 hours. (model: fconv)
Training on 233k sentences corpus: an epoch takes about 4 hours. (model: transformer)
screenshot: console log
My GPU is NVIDIA GeForce GTX 1050 and the CUDA version is 10.2.
Am I successfully training the model on GPU?
Glad to see your solutions/suggestions.
Related
I want to train a network using multiple gpus( 2x NVIDIA RTX A6000 ), on a windows 11 machine.
I tried copying the Multi-GPU and distributed training code from https://keras.io/guides/distributed_training/
However i see that GPU 0 is utilized just fine, but the GPU 1 is only utilized a little bit.
Here is a picture of the utilization:
GPUs utilization
While using the
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
I can even see huge gaps in the utiliation of GPU 1 as seen in:
GPUs utliziation .
Meaning for several epochs the second gpu was not utilized at all.
The only difference between the code in the example and my code is that I set epochs to 20, and I use:
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Since running without HierarchicalCopyAllReduce() results in an error:
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
Increasing batch size to 512 seems to help a lot and second gpu is utilized.GPUs utilization using 512 batch size
I also tried running the code with , strategy.experimental_distribute_dataset again with 512 batch size since this batch size utilized both GPUs well, however doing so makes the second gpu be not used as seen in picture below
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
train_dataset = strategy.experimental_distribute_dataset(train_dataset)
val_dataset = strategy.experimental_distribute_dataset(val_dataset)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
#model.fit(train_dataset, epochs=20, validation_data=val_dataset)
model.fit(train_dataset, epochs=20, validation_data=val_dataset, steps_per_epoch=98, validation_steps=98)
And again i see that the gpu utilization vanished GPUs utilization using experimental_distribute_dataset
My question is:
Why is the second GPU hardly utilized, isn't the batch split between the GPUs equally, ie if batch size is 128 one gpu receives 64 and the other gpu also 64? I assumed that the same model is run on both gpus and they both get half the batch to process, after which the reduce happens.
If the batch was split the same way wouldn't both gpus be similarly utilized even with small batch size?
Also why does distributing dataset using the strategy make utilization worse?
I'm playing with this colab locally with an 8gb rtx 3070 on Fedora 35 and tensorflow 2.4.0:
https://github.com/tensorflow/similarity/blob/master/examples/kaggle.ipynb
I have the same consistent errors on windows with a nvidia gtx 1050 ti (4gb).
I tried to decouple the origin of the OOM error on model.fit and seems linked to the preprocessing phase in which I'm resizing the validation set. In this phase the GPU vram is allocated.
# load validation image in memory
x_test = []
with tf.device('/CPU:0'): #solution to avoid occupation of GPU memory and OOM in model.fit
for p in tqdm(x_test_p):
img = tf.io.read_file(p)
img = tf.io.decode_image(img,dtype=tf.dtypes.float32)
img = tf.image.resize_with_pad(img, IMG_SIZE, IMG_SIZE)
# if grayscale, convert to rgb
if tf.shape(img)[2]==3:
pass
else:
img = tf.image.grayscale_to_rgb(img)
x_test.append(img)
If i reduce the validation set size alot the model.fit will succeed.
If i process the whole validation with CPU instead of GPU the
model.fit will succeed.
If i don't pass the whole preprocessed validation set to model.fit the OOM error will still be present. That's why i'm suggesting it's a problem related to the preprocessing alone occupying useful GPU Memory.
Is it possible that this preprocessing is loaded in VRAM and not released therefore limiting the model.fit GPU memory space left?
The problem is somehow similar to this question from which i took the idea of preprocessing the validation with CPU:
Keras OOM for data validation using GPU
So I have the following model for sentiment analysis (using pre trained word embeddings):
And as visible, I have a pre trained embedding matrix and only about 500k trainable parameters. So why does it take a whole eternity to train this model? The batch size is 128 and number of epochs is 25. And the ETA for first epoch is about 10 minutes. I haven't even completed that.
Just to mention, I am not using CUDA or anything. I don't think I have a GPU enabled Tensorflow. And I'm willing to do anything to increase the speed. And I have Tensorflow 2.1.0.
And here's the answer I am not using CUDA or anything. Training on CPU is much slower than on GPU. If you don't have high-performance enough video card, you can use several services such as Google Colab or Kaggle
I have tried to train VGG16 model using Tensorflow 2.0 for imagenet data on Titan RTX GPU (24G memory ).
But, estimated training time is about 52 days.(Batch size is 128, for 300 epochs)
Is it normal situation? I think it is so strange.
And, could I know there is some way to accelerate training my model?
This gist is my all code. Thank you.
I have a research about soccer result prediction using bayesian network. But, I just a beginner on it. Then I want to start study about BN (Bayesian Network) with MNIST dataset, because my soccer dataset is just similar with MNIST, or I can say my dataset imitate the MNIST. I follow tutorial from this website https://alpha-i.co/blog/MNIST-for-ML-beginners-The-Bayesian-Way.html and I try it. But I can't get the result because there is out of memory output. I train the model from tutorial in CPU with Intel(R) Core(TM) i7-6700K CPU # 4.00GHz, NVIDIA graphic with OpenGL renderer: GeForce GTX 1060 3GB/PCIe/SSE2, and RAM 32.132,6 MB. Actually, I don't quite understand with the tutorial. I had tried ANN (Artificial Neural Network) using tensorflow with 500 data before I tried the BN and I got the result of ANN. But, why I can't get the BN result? Does BN need more memory space to run? Can I get other BN tutorials with MNIST??