Getting CUDA error when trying to train MBART Model - python

from transformers import MBart50TokenizerFast
from transformers import MBartForConditionalGeneration
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt",src_lang="", tgt_lang="")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt");
batch_size = 8
args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
RuntimeError: CUDA out of memory. Tried to allocate 978.00 MiB (GPU 0; 15.74 GiB total capacity; 13.76 GiB already allocated; 351.00 MiB free; 14.02 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have recently started working in NLP and was trying to train MBART Model using my data set but every time I set it for training,I get a CUDA error.I have tried decreasing batch size as well as killing all processes on the GPU to prevent this error but I cannot seem to figure out a solution.Would anyone have an idea on how I could fix this and train the model?
The data set I am using has approximately 2 million sentences but that didn't lead to a problem when I tried using other models,so I have no idea why this is occuring,any help would be well appreciated.
The GPU I am using is NVIDIA Quadro RTX 5000.

There are a few things that you can try in order to reduce the memory footprint and avoid OOM issues:
Gradient accumulation: When using gradient accumulation, gradient calculation is done in smaller steps rather than all at once for a batch. In order to use this, all you have to do is set the gradient_accumulation_steps argument to a number that would fit into memory, and modify the per_device_train_batch_size to original_batch_size/gradient_accumulation_steps. For example, assuming your GPU can take up to a batch size of 2 (and ideally you want to max it out), and you intend to train with a batch size of 8, this is how you should set up your training arguments to ensure that it fits into memory:
batch_size = 8
gradient_accumulation_step = 2 #needs to be a batch size that can fit into memory
args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
gradient_accumulation_steps = gradient_accumulation_steps,
per_device_train_batch_size=batch_size/gradient_accumulation_step,
per_device_eval_batch_size=batch_size/gradient_accumulation_step,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")
Gradient checkpointing: This approach helps save memory by reducing the memory overhead by saving only selective activations as opposed to saving all of them. To use this, the gradient_checkpointing needs to be set to True
I highly recommend reading the Performance and scalability section of the transformers documentation to understand the pros and cons of the aforementioned approaches better, and to familiarize yourself with other techniques as well such as mixed precision training and optimizer usage.

Related

Are deep and wide autoencoder trainings just slow or is there something wrong here?

I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)

Obtain a set of embedding from pretrained model - vgg16 pytorch

For a certain project purpose I am trying to store the 1 * 4096 embeddings (The output right before the final layer) of around 6000 images into a pkl file. For the same, I am running an iteration over the 6000 images on vgg16 modified model in google colab. But it returns 'CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.90 GiB total capacity; 14.86 GiB already allocated; 1.88 MiB free; 342.26 MiB cached)' error.
Whereas I have used the same dataset split into test-train for training and validating my model and that runs fine. I am wondering why obtaining and storing the embedding alone is becoming a heavy task in colab.
Is there any other way I can obtain the embeddings and store in a pkl file other than the below code.
embedding = []
vgg16 = vgg16.to(device)
for x in range (0, len(inputImages)) :
input = transformations(inputImages[x]) //pre processing
input = torch.unsqueeze(input, 0)
input = input.to(device)
embedding.append(vgg16(input))
The code is interupted at the last line with the CUDA out of memory error.
The output that you have generated vgg16(input), thats still in cuda. This is so because this output is used for calculating the loss afterwards. So to avoid having your output being stored in CUDA and eat up your GPU memory, move it to CPU using .cpu().numpy(). If that throws an error, you might have to use .detach() as well to detach the variable.

How to avoid "CUDA out of memory" in PyTorch

I think it's a pretty common message for PyTorch users with low GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already allocated; 😊 MiB free; 😊 cached)
I tried to process an image by loading each layer to GPU and then loading it back:
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
Although
import torch
torch.cuda.empty_cache()
provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
import gc
del variables
gc.collect()
But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables.
So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
Another way to get a deeper insight into the alloaction of memory in gpu is to use:
torch.cuda.memory_summary(device=None, abbreviated=False)
wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
Just reduce the batch size, and it will work.
While I was training, it gave following error:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB
total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB
reserved in total by PyTorch)
And I was using batch size of 32. So I just changed it to 15 and it worked for me.
Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:
for e in range(epochs):
for images, labels in train_loader:
if torch.cuda.is_available():
images, labels = images.cuda(), labels.cuda()
# blablabla
You can also use dtypes that use less memory. For instance, torch.float16 or torch.half.
Try not drag your grads too far.
I got the same error when I tried to sum up loss in all batches.
loss = self.criterion(pred, label)
total_loss += loss
Then I use loss.item instead of loss which requires grads, then solved the problem
loss = self.criterion(pred, label)
total_loss += loss.item()
The solution below is credited to yuval reina in the kaggle question
This error is related to the GPU memory and not the general memory => #cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer
Most things are covered, still will add a little.
If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.
Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.
Follow these steps:
Reduce train,val,test data
Reduce batch size {eg. 16 or 32}
Reduce number of model parameters {eg. less than million}
In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.
If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:
with torch.no_grad():
for m in self.children():
m.cuda()
m.eval()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.
There are ways to avoid, but it certainly depends on your GPU memory size:
Loading the data in GPU when unpacking the data iteratively,
features, labels in batch:
features, labels = features.to(device), labels.to(device)
Using FP_16 or single precision float dtypes.
Try reducing the batch size if you ran out of memory.
Use .detach() method to remove tensors from GPU which are not needed.
If all of the above are used properly, PyTorch library is already highly optimizer and efficient.
Implementation:
Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.
Technically:
Most networks are over parameterized, which means they are too large for the learning tasks. So finding an appropriate network structure can help:
a. Compact your network with techniques like model compression, network pruning and quantization.
b. Directly using a more compact network structure like mobileNetv1/2/3.
c. Network architecture search(NAS).
I have the same error but fix it by resize my images from ~600 to 100 using the lines:
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.Resize((100, 100)),
transforms.ToTensor()
])
Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.
I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.
Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.
There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila
pip install koila
in your code, simply wrap the input with lazy:
from koila import lazy
input = lazy(input, batch=0)
As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error.
I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than
I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3.
Hope this can help you
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3.
In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.
$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
2- You can check by reducing train batch size also.
If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!
Best way would be lowering down the batch size. Usually it works. Otherwise try this:
import gc
del variable #delete unnecessary variables
gc.collect()

Understanding trained neural network memory usage

Background
I have a single layer, 256 hidden-unit, RNN that I've trained with Keras and that I now want to deploy. Ideally, I would like to deploy multiple instances of this RNN onto a GPU. However, at this point, when I load the model with keras.models.load_model(), it seems to be using 11Gb of my available 12Gb of GPU memory.
Questions
Why is my network, which is quite small, taking up so much memory? I only want to predict, not train. Am I loading the model the wrong way?
Is there some way I can generally understand the mapping of my RNN structure to the amount of GPU memory it will use?
Given this understanding, how do I reduce the amount of memory consumed by my RNN?
Current Understanding
My current estimate of how much memory my network should use is given from the number of hyper-parameters:
256 input weights
256 output weights
256x256 recurrent weights
256 hidden units
256 hidden unit biases
Total: 32 bits/parameter x (4 x 256 + 256 x 256) parameters = 260e6 bits
This is significantly less then what I'm currently seeing. So my hypothesis is that Keras thinks I'm still training my model and thus is trying to cache batch error sizes. But how else am I supposed to load my model?
No, it's just a strategy of gpu memory usage. Keras is generally based on tensorflow, and tensorflow default map all your free gpu memory in order to avoid dynamical memory allocation regardless how much memory you will really use.
You can configure it like below:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3 # or any valid options.
set_session(tf.Session(config=config))

What is the right way to manage memory in Theano for training sets that cannot fit in RAM?

TL;DR:
How do I give more data to a Theano function without taking more memory?
The problem I'm having is that training my ML algorithm on the GPU with Theano causes the GPU to eventually run out of memory. I took a slight departure from the tutorial because my dataset is too big to read entirely into memory (this must be an issue for video algorithms too, right?), so rather than using an index input and update scheme, I just pass the Theano function the ndarrays directly.
Let me give an example of what I mean. In the Logistic Regression tutorial in Theano it says to do something along the lines of:
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
This requires test_set_x and test_set_y to be loaded into memory, and the tutorial uses a SharedVariable to store the complete dataset.
Ok for me, the dataset is huge (many many gigabytes), which means it cannot all be loaded into memory at once, so I modified mine to take the data directly, thusly:
train_model = theano.function(
inputs=[input, classes],
outputs=cost,
updates=updates
)
and then I do something that looks vaguely like this:
for count, data in enumerate(extractor):
observations, labels = data
batch_cost = train_model(observations, labels)
logger.debug("Generation %d: %f cost", count, batch_cost)
I think I may be fundamentally misunderstanding how to properly hand data to the GPU without some nasty python garbage-collection dirtiness. It seems like this is just occupying more and more memory in the model internally, because after training this after a (large) number of batches, I get an error like this:
Error when tring to find the memory information on the GPU: initialization error
Error freeing device pointer 0x500c88000 (initialization error). Driver report 0 bytes free and 0 bytes total
CudaNdarray_uninit: error freeing self->devdata. (self=0x10cbbd170, self->devata=0x500c88000)
Exception MemoryError: 'error freeing device pointer 0x500c88000 (initialization error)' in 'garbage collection' ignored
Fatal Python error: unexpected exception during garbage collection
How do I give more data to a Theano function, without taking up more memory?
If the dataset does not fit in memory, the idea is to take a portion of it and load it each time you need.
If your data does not fit in the gpu memory, as seen in the classic lasagne tutorial, you can iterate over portion of the dataset, called minibatches
Then, if your data does not fit in your RAM, you need to load the minibatch each time you need it. Best way to do that is to make a separate process load the next minibatch (cpu working) as you are analysing the current one (gpu working)
You can inspire yourself from AlexNet :

Categories

Resources