why "RuntimeError CUDA out of memory" in testing?

why "RuntimeError CUDA out of memory" in testing? - python

The same model ran fine for training with batch-size=5. I reduced the batch size from 80 to 5 during training because of the same error. I am using a GPU with 11GB of memory instead of Titan X (12GB memory), the one used by the author in actual experiment.
However, now in testing, which only has batch-size=1, it is not running.
The issue is in I-frame model testing phase, the other two models have successfully produced results for testing.
Following is my testing command:
time python test.py --arch resnet152 --data-name ucf101 --representation iframe --data-root data/ucf101/mpeg4_videos --test-list data/datalists/ucf101_split1_test.txt --weights ucf101_iframe_model_iframe_model_best.pth.tar --save-scores iframe_score_file
I have used nvidia-smi to make sure nothing else is running on the GPU.
Following is the actual error message:
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 10.92 GiB total capacity; 10.12 GiB already allocated; 245.50 MiB free; 21.69 MiB cached)
What could be the issue and how it can be fixed?
EDIT: By removing the following two lines from test.py, it starts running without an memeory issue, but it is taking ages to process:
net = torch.nn.DataParallel(net.cuda(devices[0]), device_ids=devices)
net.eval()
Yes, the above lines are for GPU based parallel processing.
Still, is there a solution to my problem?

I suggest that you may check your test code first.
You can try:
with torch.no_grad():
It will reduce memory consumption for computations that would otherwise have requires_grad=True.
Original Answer(you can try it if you have a bigger GPU):
Maybe the model itself and parameters take up a lot of memory.
You can try "batch-size=1" on your Titan X GPU which you used before and watch whether GPU memory usage is more than 11 GB. If so, the GPU you use now(11 GB memory) may not suitable for this work.

I have run this model/testing on GPU with memory upto 8GB, by adding the following flag in the testing command given in the question:
--test-crops 1

Related

What machine specs do you need in training of DETR?(End-to-End Object Detection with Transformers)

I am researching the machine specs needed for DETR training.
However, I only have a geforce 1660 super and I got an "out of memory" error. So, please let me know how much machine specs you have to use to complete the DETR training.
Please help me with my research.
DETR(https://github.com/facebookresearch/detr)

Your are getting out of memory error because your GPU memory isn't sufficient to hold up the batch-size you input. Try running the code with minimum batch-size possible, see how much memory it consumes, increase the batch-size slightly, again check increase in memory consumption. This was you will be able to estimate how much GPU memory you require to run it with the actual batch-size.
I had the same issue, I switched to a machine with larger GPU memory (around 24 GB) and then everything worked fine!

Allocator ran out of memory - how to clear GPU memory from TensorFlow dataset?

Assuming a Numpy array X_train of shape (4559552, 13, 22), the following code:
train_dataset = tf.data.Dataset \
.from_tensor_slices((X_train, y_train)) \
.shuffle(buffer_size=len(X_train) // 10) \
.batch(batch_size)
works fine exactly once. When I re-run it (after slight modifications to X_train), it then triggers an InternalError due to an out of memory GPU:
2021-12-19 15:36:58.460497: W tensorflow/core/common_runtime/bfc_allocator.cc:457]
Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.71GiB requested by op _EagerConst
It seems that the first time, it finds 100% free GPU memory so all works fine, but the subsequent times, the GPU memory is already almost full and hence the error.
From what I understand, it seems that simply clearing GPU memory from the old train_dataset would be sufficient to solve the problem, but I couldn't find any way to achieve this in TensorFlow. Currently the only way to re-assign the dataset is to kill the Python kernel and re-run everything from start.
Is there a way to avoid re-starting the Python kernel from scratch and instead free the GPU memory so that the new dataset can be loaded into it?
The dataset doesn't need full GPU memory, so I would consider switching to a TFRecord solution as a non-ideal solution here (as it comes with additional complications).

Try setting a hard limit on the total GPU memory as shown in here
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

Not understanding CUDA resources and keep running out of memory

I think I'm running Tensor PyTorch.
I am new to python, and trying to use it experimenting with convolutional Neural networks and processing larger images. But I keep running into this error, even if I Request smaller image outputs. I just signed up for Colab Pro. While it is certainly faster, it still errors out with the CUDA. I would reallocate memory if I knew how, but I don't. Are there any other other way to access/manage GPU memory??
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line
255, in backward torch.autograd.backward(self, gradient, retain_graph,
create_graph, inputs=inputs) File
"/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py",
line 149, in backward allow_unreachable=True, accumulate_grad=True) #
allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to
allocate 114.00 MiB (GPU 0; 15.78 GiB total capacity; 13.27 GiB
already allocated; 4.75 MiB free; 14.43 GiB reserved in total by
PyTorch) VGG-19 Architecture Detected Successfully loaded
models/vgg19-d01eb7cb.pth conv1_1: 64 3 3 3 conv1_2: 64 64 3 3
conv2_1: 128 64 3 3 conv2_2:

I have shown below ways to manage GPU memory in pytorch, but often these ways are not suggested ways to deal with CUDA Errors like yours.
The reason you get this error has nothing to do with the size of your output but by the size of your input. You either have way too big images coming into your network in which you may need to use transforms.Resize() or your batch size is way to big, so you are calling for a huge parralel computation and thus need to lower that number in the dataloader.
The ways to remove a tensor from gpu memory can be done by using
a = torch.tensor(1)
del a
# Though not suggested and not rlly needed to be called explicitly
torch.cuda.empty_cache()
The ways to allocate a tensor to cuda memory is to simply move the tensor to device using
a = torch.tensor(1)
a = a.cuda()
# OR
device = torch.device("cuda")
a = a.to(device)
Sarthak Jain

Why pytorch needs much more memory than it should?

I'm just playing around with pytorch and I'm wondering why it consumes so much memory of my GPU?
I'm using Cuda 10.0 with pythorch 1.2.0 and torchvision 0.4.0.
import torch
gpu = torch.device("cuda")
x = torch.ones(int(4e8), device=gpu)
y = torch.ones(int(1e5), device=gpu)
Running the above code I get the error:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 2.00 GiB total capacity; 1.49 GiB already allocated; 0 bytes free; 0 bytes cached)
So, does pytorch needs ~500MB of the gpu memory as overhead? Or what is the problem here?

More information and testing done by xymeng in github could be seen in the given link
Referencing xymeng's words :
PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+
Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch's kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.
Seems it suits your situation.

How to fix this strange error: "RuntimeError: CUDA error: out of memory"

I successfully trained the network but got this error during validation:
RuntimeError: CUDA error: out of memory

The best way is to find the process engaging gpu memory and kill it:
find the PID of python process from:
nvidia-smi
copy the PID and kill it by:
sudo kill -9 pid

1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Even if docs guides with float(), in case of me, item() also worked like
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...

The error occurs because you ran out of memory on your GPU.
One way to solve it is to reduce the batch size until your code runs without this error.

I had the same issue and this code worked for me :
import gc
gc.collect()
torch.cuda.empty_cache()

It might be for a number of reasons that I try to report in the following list:
Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html

I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.
Check whether the cause is really due to your GPU memory, by a code below.
import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')
If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.)
Pytorch install link
A similar case will happen also for Tensorflow/Keras.

If you are getting this error in Google Colab use this code:
import torch
torch.cuda.empty_cache()

In my experience, this is not a typical CUDA OOM Error caused by PyTorch trying to allocate more memory on the GPU than you currently have.
The giveaway is the distinct lack of the following text in the error message.
Tried to allocate xxx GiB (GPU Y; XXX GiB total capacity; yyy MiB already allocated; zzz GiB free; aaa MiB reserved in total by PyTorch)
In my experience, this is an Nvidia driver issue. A reboot has always solved the issue for me, but there are times when a reboot is not possible.
One alternative to rebooting is to kill all Nvidia processes and reload the drivers manually. I always refer to the unaccepted answer of this question written by Comzyh when performing the driver cycle. Hope this helps anyone trapped in this situation.

If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.
My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

Problem solved by the following code:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'

Not sure if this'll help you or not, but this is what solved the issue for me:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Nothing else in this thread helped.

I faced the same issue with my computer. All you have to do is customize your configuration file to match your computer's specifications. Turns out my computer takes image sizes below 600 X 600 and when I adjusted the same in the configuration file, the program ran smoothly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

why "RuntimeError CUDA out of memory" in testing? - python

I have run this model/testing on GPU with memory upto 8GB, by adding the following flag in the testing command given in the question: --test-crops 1

Related

What machine specs do you need in training of DETR?(End-to-End Object Detection with Transformers)

Allocator ran out of memory - how to clear GPU memory from TensorFlow dataset?

Not understanding CUDA resources and keep running out of memory

Why pytorch needs much more memory than it should?

How to fix this strange error: "RuntimeError: CUDA error: out of memory"

Categories

Resources