Obtain a set of embedding from pretrained model - vgg16 pytorch - python

For a certain project purpose I am trying to store the 1 * 4096 embeddings (The output right before the final layer) of around 6000 images into a pkl file. For the same, I am running an iteration over the 6000 images on vgg16 modified model in google colab. But it returns 'CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.90 GiB total capacity; 14.86 GiB already allocated; 1.88 MiB free; 342.26 MiB cached)' error.
Whereas I have used the same dataset split into test-train for training and validating my model and that runs fine. I am wondering why obtaining and storing the embedding alone is becoming a heavy task in colab.
Is there any other way I can obtain the embeddings and store in a pkl file other than the below code.
embedding = []
vgg16 = vgg16.to(device)
for x in range (0, len(inputImages)) :
input = transformations(inputImages[x]) //pre processing
input = torch.unsqueeze(input, 0)
input = input.to(device)
embedding.append(vgg16(input))
The code is interupted at the last line with the CUDA out of memory error.

The output that you have generated vgg16(input), thats still in cuda. This is so because this output is used for calculating the loss afterwards. So to avoid having your output being stored in CUDA and eat up your GPU memory, move it to CPU using .cpu().numpy(). If that throws an error, you might have to use .detach() as well to detach the variable.

Related

Getting CUDA error when trying to train MBART Model

from transformers import MBart50TokenizerFast
from transformers import MBartForConditionalGeneration
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt",src_lang="", tgt_lang="")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt");
batch_size = 8
args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
RuntimeError: CUDA out of memory. Tried to allocate 978.00 MiB (GPU 0; 15.74 GiB total capacity; 13.76 GiB already allocated; 351.00 MiB free; 14.02 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have recently started working in NLP and was trying to train MBART Model using my data set but every time I set it for training,I get a CUDA error.I have tried decreasing batch size as well as killing all processes on the GPU to prevent this error but I cannot seem to figure out a solution.Would anyone have an idea on how I could fix this and train the model?
The data set I am using has approximately 2 million sentences but that didn't lead to a problem when I tried using other models,so I have no idea why this is occuring,any help would be well appreciated.
The GPU I am using is NVIDIA Quadro RTX 5000.
There are a few things that you can try in order to reduce the memory footprint and avoid OOM issues:
Gradient accumulation: When using gradient accumulation, gradient calculation is done in smaller steps rather than all at once for a batch. In order to use this, all you have to do is set the gradient_accumulation_steps argument to a number that would fit into memory, and modify the per_device_train_batch_size to original_batch_size/gradient_accumulation_steps. For example, assuming your GPU can take up to a batch size of 2 (and ideally you want to max it out), and you intend to train with a batch size of 8, this is how you should set up your training arguments to ensure that it fits into memory:
batch_size = 8
gradient_accumulation_step = 2 #needs to be a batch size that can fit into memory
args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
gradient_accumulation_steps = gradient_accumulation_steps,
per_device_train_batch_size=batch_size/gradient_accumulation_step,
per_device_eval_batch_size=batch_size/gradient_accumulation_step,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")
Gradient checkpointing: This approach helps save memory by reducing the memory overhead by saving only selective activations as opposed to saving all of them. To use this, the gradient_checkpointing needs to be set to True
I highly recommend reading the Performance and scalability section of the transformers documentation to understand the pros and cons of the aforementioned approaches better, and to familiarize yourself with other techniques as well such as mixed precision training and optimizer usage.

Using Pre-trained Transformer Model to Expand Short Sentences to Long Sentences

I need to build a model that can expand multiple short sentences to multiple long sentences. I was thinking to use a pre-trained Transformer model to do this just like when we want to do a paragraph or text summarization except, in this case, I switched the output and input values. I tried this using t5-base, ran it on Google Colab, and using really minimum data like 10 rows of data, the idea was to see whether it works or not regardless of the output. But I always got errors like below:
RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0;
11.17 GiB total capacity; 10.29 GiB already allocated; 237.81 MiB free; 10.49 GiB reserved in total by PyTorch)
I interpret this error as I did something wrong or my idea did not work. Is there anyone who can suggest how to do this?
Please advise
I reduce the batch size and it solves the problem.

Not understanding CUDA resources and keep running out of memory

I think I'm running Tensor PyTorch.
I am new to python, and trying to use it experimenting with convolutional Neural networks and processing larger images. But I keep running into this error, even if I Request smaller image outputs. I just signed up for Colab Pro. While it is certainly faster, it still errors out with the CUDA. I would reallocate memory if I knew how, but I don't. Are there any other other way to access/manage GPU memory??
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line
255, in backward torch.autograd.backward(self, gradient, retain_graph,
create_graph, inputs=inputs) File
"/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py",
line 149, in backward allow_unreachable=True, accumulate_grad=True) #
allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to
allocate 114.00 MiB (GPU 0; 15.78 GiB total capacity; 13.27 GiB
already allocated; 4.75 MiB free; 14.43 GiB reserved in total by
PyTorch) VGG-19 Architecture Detected Successfully loaded
models/vgg19-d01eb7cb.pth conv1_1: 64 3 3 3 conv1_2: 64 64 3 3
conv2_1: 128 64 3 3 conv2_2:
I have shown below ways to manage GPU memory in pytorch, but often these ways are not suggested ways to deal with CUDA Errors like yours.
The reason you get this error has nothing to do with the size of your output but by the size of your input. You either have way too big images coming into your network in which you may need to use transforms.Resize() or your batch size is way to big, so you are calling for a huge parralel computation and thus need to lower that number in the dataloader.
The ways to remove a tensor from gpu memory can be done by using
a = torch.tensor(1)
del a
# Though not suggested and not rlly needed to be called explicitly
torch.cuda.empty_cache()
The ways to allocate a tensor to cuda memory is to simply move the tensor to device using
a = torch.tensor(1)
a = a.cuda()
# OR
device = torch.device("cuda")
a = a.to(device)
Sarthak Jain

MemoryError: Unable to allocate 5.62 GiB for an array with shape (16384, 30720, 3) and data type float32 When training StyleGan2

Training on tensorflow 1.15, python3.7.
I am currently training stylegan2 on a custom dataset consisting of 30000 images, each 256 by 256. Since style gan creates different tf record files, each storing dataset with size 2^x. the 8th tf record (storign 256x256 images) is 5Gb just for context, the rest are super small (all less than a gb).
My current setup is a P100, 16 gigs VRam, 32 gb ram and abundance of storage . I also 2vCpus (training on gcp).
I am running into this error as mentioned above. Initially my memory was 13 gigs, after seeing the exact error multiple times, i iteratively upped my memory to an eventual 32 gb.
Any and all "pointers" would be helpful (Notice the pun on pointers haha)
OKAY I SOLVED IT. there was an issue with the .pkl file that I was using for transfer learning. use a pickle file that contains a model whos discriminator starts with input layer of your image shape (eg 256x256).

How to avoid "CUDA out of memory" in PyTorch

I think it's a pretty common message for PyTorch users with low GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already allocated; 😊 MiB free; 😊 cached)
I tried to process an image by loading each layer to GPU and then loading it back:
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
Although
import torch
torch.cuda.empty_cache()
provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
import gc
del variables
gc.collect()
But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables.
So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
Another way to get a deeper insight into the alloaction of memory in gpu is to use:
torch.cuda.memory_summary(device=None, abbreviated=False)
wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
Just reduce the batch size, and it will work.
While I was training, it gave following error:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB
total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB
reserved in total by PyTorch)
And I was using batch size of 32. So I just changed it to 15 and it worked for me.
Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:
for e in range(epochs):
for images, labels in train_loader:
if torch.cuda.is_available():
images, labels = images.cuda(), labels.cuda()
# blablabla
You can also use dtypes that use less memory. For instance, torch.float16 or torch.half.
Try not drag your grads too far.
I got the same error when I tried to sum up loss in all batches.
loss = self.criterion(pred, label)
total_loss += loss
Then I use loss.item instead of loss which requires grads, then solved the problem
loss = self.criterion(pred, label)
total_loss += loss.item()
The solution below is credited to yuval reina in the kaggle question
This error is related to the GPU memory and not the general memory => #cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer
Most things are covered, still will add a little.
If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.
Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.
Follow these steps:
Reduce train,val,test data
Reduce batch size {eg. 16 or 32}
Reduce number of model parameters {eg. less than million}
In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.
If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:
with torch.no_grad():
for m in self.children():
m.cuda()
m.eval()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.
There are ways to avoid, but it certainly depends on your GPU memory size:
Loading the data in GPU when unpacking the data iteratively,
features, labels in batch:
features, labels = features.to(device), labels.to(device)
Using FP_16 or single precision float dtypes.
Try reducing the batch size if you ran out of memory.
Use .detach() method to remove tensors from GPU which are not needed.
If all of the above are used properly, PyTorch library is already highly optimizer and efficient.
Implementation:
Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.
Technically:
Most networks are over parameterized, which means they are too large for the learning tasks. So finding an appropriate network structure can help:
a. Compact your network with techniques like model compression, network pruning and quantization.
b. Directly using a more compact network structure like mobileNetv1/2/3.
c. Network architecture search(NAS).
I have the same error but fix it by resize my images from ~600 to 100 using the lines:
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.Resize((100, 100)),
transforms.ToTensor()
])
Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.
I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.
Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.
There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila
pip install koila
in your code, simply wrap the input with lazy:
from koila import lazy
input = lazy(input, batch=0)
As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error.
I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than
I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3.
Hope this can help you
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3.
In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.
$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
2- You can check by reducing train batch size also.
If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!
Best way would be lowering down the batch size. Usually it works. Otherwise try this:
import gc
del variable #delete unnecessary variables
gc.collect()

Categories

Resources