I want to train a large object detection model using TF2, preferrably the EfficientDet D7 network. With my Tesla P100 card that has 16 GB of memory I am running into an "out of memory" exception, i.e. not enough memory on the graphics card can be allocated.
So I am wondering what my options are in this case. Is it correct that if I would have multiple GPUs, then the TF model would be split so that it fills memory of both cards? So in my case, with a second Tesla card again with 16 GB I would have 32 GB in total during training? If that is the case would that also be true for a cloud provider, where I could utilize multiple GPUs?
Moreover, if I am wrong and it would not work to split a model for multiple GPUs during training, what other approach would work in order to train a large network that does not fit into my GPU memory?
PS: I know that I could reduce the batch_size to 1, but unfortunately that does still not solve my issue for the really large models ...
You can use multiple GPU's in GCP (Google Cloud Platform) atleast, not too sure about other cloud providers. And yes, once you do that, you can train with a larger batch size (exact number would depend on the GPU, it's memory and how may you GPU's you have running in your VM)
You can check this link for the list of all GPU's available in GCP
If you're using the object detection API, you can check this post regarding training using multiple GPU's.
Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post
I am training a model with Keras which constitutes of a Huggingface RoBERTa model as a backbone with a downstream task of span prediction and binary prediction for text.
I have been training the model regularly with datasets which are under 2 Gb in size, which has worked fine. The dataset has grown in size in recent weeks and now recently, it has gotten to around 2.3 Gb in size which makes it over the 2 Gb google protobuf hard limit. This makes it impossible to train the model with keras with numpy tensors without a generator on TPUs as tensorflow uses google protobuf to buffer the tensors for the TPUs, and trying to serve all the data without a generator fails. If I use a dataset under 2 Gb in size, everything works fine. TPUs don't support Keras generators yet, so I was looking into using the tf.data.Dataset api instead.
After seeing this question I adopted code from this gist trying to get this to work, resulting in the following code:
def tfdata_generator(x, y, is_training, batch_size=384):
dataset = tf.data.Dataset.from_tensor_slices((x, y))
if is_training:
dataset = dataset.shuffle(1000)
dataset = dataset.map(map_fn)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset
The model is created and compiled for TPU use as before which has never caused any problems and then I create the generators and call the fit function:
train_gen = tfdata_generator(x_train, y_train, is_training=True)
model.fit(
train_gen,
steps_per_epoch=10000,
epochs=1,
)
This results in the following error:
FetchOutputs node : not found [Op:AutoShardDataset]
edit: Colab with bare minimum code and a dummy dataset - unfortunately, b/c of Colab RAM restrictions, building a dummy dataset exceeding 2 Gb in size crashes the notebook. But still, displays code that runs and works on CPU/TPU with a smaller dataset.
This code does however work on a CPU. I can't find any further information on this error online and haven't been able to find more detailed information on how to use TPUs with Keras servicing training data using generators. Have looked into tfrecords a bit but also find documentation on TPUs missing. All help appreciated!
For numpy tensors, 2GB seams to a hard limit for TPU training (as of now).
I see 2 workarounds that you could use.
Write your tf.data to a gs bucket as TFRecord/CSV using TFRecordWriter and let the TPU use training data from that bucket.
Use tf.data service, for your input pipeline. It's a relatively new service that let's you run your data pipeline on separate workers. For details on how to run please see running_the_tfdata_service.
I tried to train a image classifier using tensorflow. I used data api to load the dataset and i used dataset caching to speed up training process. while trying to training the model i struck with a error called Resource Exhausted. I tried to change the batch size even after trying different batch size like 32,64,128 i could not over come this problem
I have tried to remove some layers but i could not fix this error.
Check your batch_size. Decrease it. It seems it is overwhelming.
I think it's a pretty common message for PyTorch users with low GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already allocated; 😊 MiB free; 😊 cached)
I tried to process an image by loading each layer to GPU and then loading it back:
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
Although
import torch
torch.cuda.empty_cache()
provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
import gc
del variables
gc.collect()
But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables.
So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
Another way to get a deeper insight into the alloaction of memory in gpu is to use:
torch.cuda.memory_summary(device=None, abbreviated=False)
wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
Just reduce the batch size, and it will work.
While I was training, it gave following error:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB
total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB
reserved in total by PyTorch)
And I was using batch size of 32. So I just changed it to 15 and it worked for me.
Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:
for e in range(epochs):
for images, labels in train_loader:
if torch.cuda.is_available():
images, labels = images.cuda(), labels.cuda()
# blablabla
You can also use dtypes that use less memory. For instance, torch.float16 or torch.half.
Try not drag your grads too far.
I got the same error when I tried to sum up loss in all batches.
loss = self.criterion(pred, label)
total_loss += loss
Then I use loss.item instead of loss which requires grads, then solved the problem
loss = self.criterion(pred, label)
total_loss += loss.item()
The solution below is credited to yuval reina in the kaggle question
This error is related to the GPU memory and not the general memory => #cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer
Most things are covered, still will add a little.
If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.
Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.
Follow these steps:
Reduce train,val,test data
Reduce batch size {eg. 16 or 32}
Reduce number of model parameters {eg. less than million}
In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.
If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:
with torch.no_grad():
for m in self.children():
m.cuda()
m.eval()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.
There are ways to avoid, but it certainly depends on your GPU memory size:
Loading the data in GPU when unpacking the data iteratively,
features, labels in batch:
features, labels = features.to(device), labels.to(device)
Using FP_16 or single precision float dtypes.
Try reducing the batch size if you ran out of memory.
Use .detach() method to remove tensors from GPU which are not needed.
If all of the above are used properly, PyTorch library is already highly optimizer and efficient.
Implementation:
Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.
Technically:
Most networks are over parameterized, which means they are too large for the learning tasks. So finding an appropriate network structure can help:
a. Compact your network with techniques like model compression, network pruning and quantization.
b. Directly using a more compact network structure like mobileNetv1/2/3.
c. Network architecture search(NAS).
I have the same error but fix it by resize my images from ~600 to 100 using the lines:
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.Resize((100, 100)),
transforms.ToTensor()
])
Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.
I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.
Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.
There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila
pip install koila
in your code, simply wrap the input with lazy:
from koila import lazy
input = lazy(input, batch=0)
As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error.
I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than
I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3.
Hope this can help you
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3.
In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.
$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
2- You can check by reducing train batch size also.
If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!
Best way would be lowering down the batch size. Usually it works. Otherwise try this:
import gc
del variable #delete unnecessary variables
gc.collect()
I'm trying to train a model (implementation of a research paper) on K80 GPU with 12GB memory available for training. The dataset is about 23 GB and after data extraction, it shrinks to 12GB for the training script.
At about 4640th step (max_steps being 500,000), I receive the following error saying Resource Exhausted and the script stops soon after that. -
The memory usage at the beginning of the script is:
I went through a lot of similar questions and found that reducing the batch-size might help but I have reduced the batch-size to 50 and the error persists. Is there any other solution except switching to a more powerful GPU?
This does not look like a GPU Out Of Memory (OOM) error but more like you ran out of space on your local drive to save the checkpoint of your model.
Are you sure that you have enough space on your disk or that the folder you save to doesn't have a quotta?