Why pytorch needs much more memory than it should? - python

I'm just playing around with pytorch and I'm wondering why it consumes so much memory of my GPU?
I'm using Cuda 10.0 with pythorch 1.2.0 and torchvision 0.4.0.
import torch
gpu = torch.device("cuda")
x = torch.ones(int(4e8), device=gpu)
y = torch.ones(int(1e5), device=gpu)
Running the above code I get the error:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 2.00 GiB total capacity; 1.49 GiB already allocated; 0 bytes free; 0 bytes cached)
So, does pytorch needs ~500MB of the gpu memory as overhead? Or what is the problem here?

More information and testing done by xymeng in github could be seen in the given link
Referencing xymeng's words :
PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+
Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch's kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.
Seems it suits your situation.

Related

CUDA out of memory error with a batch size of 1 even after emptying cuda cache

I'm training a huggingface xlnet-large-cased model with the following specs:
args = TrainingArguments( f"xlnet-large-finetuned", evaluation_strategy = "epoch", save_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=1, per_device_eval_batch_size=1, num_train_epochs=3, gradient_accumulation_steps=16, weight_decay=0.01, load_best_model_at_end=True, metric_for_best_model="accuracy" )
and by calling this code: trainer = Trainer( model, args, train_dataset=tokenized_train_dataset, eval_dataset=tokenized_val_dataset, data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics ) , trainer.train().
I reduced the batch size to 1, emptied cuda cache and deleted all the variables in gc but I still get this error: RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 15.78 GiB total capacity; 14.31 GiB already allocated; 2.75 MiB free; 14.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there any way I could resolve this without having to acquire more GPU credits?
There is a method named "Mixed Precision", the idea is to convert parameters from float32 to float16 to speed up the training and reduce memory use, the detail of mixed precision.
In some repositories, you can see they implement "automatic mixed precision" by apex package. However, with the newest version of Pytorch, you can use it easily with torch.cuda.amp by wrapping the computation code in autocast() and control the gradient and loss scale by the scaler.

Allocator ran out of memory - how to clear GPU memory from TensorFlow dataset?

Assuming a Numpy array X_train of shape (4559552, 13, 22), the following code:
train_dataset = tf.data.Dataset \
.from_tensor_slices((X_train, y_train)) \
.shuffle(buffer_size=len(X_train) // 10) \
.batch(batch_size)
works fine exactly once. When I re-run it (after slight modifications to X_train), it then triggers an InternalError due to an out of memory GPU:
2021-12-19 15:36:58.460497: W tensorflow/core/common_runtime/bfc_allocator.cc:457]
Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.71GiB requested by op _EagerConst
It seems that the first time, it finds 100% free GPU memory so all works fine, but the subsequent times, the GPU memory is already almost full and hence the error.
From what I understand, it seems that simply clearing GPU memory from the old train_dataset would be sufficient to solve the problem, but I couldn't find any way to achieve this in TensorFlow. Currently the only way to re-assign the dataset is to kill the Python kernel and re-run everything from start.
Is there a way to avoid re-starting the Python kernel from scratch and instead free the GPU memory so that the new dataset can be loaded into it?
The dataset doesn't need full GPU memory, so I would consider switching to a TFRecord solution as a non-ideal solution here (as it comes with additional complications).
Try setting a hard limit on the total GPU memory as shown in here
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

GPU Runtime error on Geforce Nvidia MX130

I am trying to finetune an Spacy NER model using BERT a
#Train the data
!python -m spacy train -g 0 config_spacy_bert.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
Batch size in the config file is 2 and I am getting an error
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 1.96 GiB total capacity; 958.13 MiB already allocated; 11.25 MiB free; 978.00 MiB reserved in total by PyTorch)
How is it possible to remove this error
You don't have enough GPU memory. You need to get a bigger GPU, or not use BERT, or use a smaller model.
The recommended GPU memory size with spaCy is 10GB; you can sometimes make do with 8GB or slightly less, but it looks like you only have 2GB, which is just not enough.

Not understanding CUDA resources and keep running out of memory

I think I'm running Tensor PyTorch.
I am new to python, and trying to use it experimenting with convolutional Neural networks and processing larger images. But I keep running into this error, even if I Request smaller image outputs. I just signed up for Colab Pro. While it is certainly faster, it still errors out with the CUDA. I would reallocate memory if I knew how, but I don't. Are there any other other way to access/manage GPU memory??
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line
255, in backward torch.autograd.backward(self, gradient, retain_graph,
create_graph, inputs=inputs) File
"/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py",
line 149, in backward allow_unreachable=True, accumulate_grad=True) #
allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to
allocate 114.00 MiB (GPU 0; 15.78 GiB total capacity; 13.27 GiB
already allocated; 4.75 MiB free; 14.43 GiB reserved in total by
PyTorch) VGG-19 Architecture Detected Successfully loaded
models/vgg19-d01eb7cb.pth conv1_1: 64 3 3 3 conv1_2: 64 64 3 3
conv2_1: 128 64 3 3 conv2_2:
I have shown below ways to manage GPU memory in pytorch, but often these ways are not suggested ways to deal with CUDA Errors like yours.
The reason you get this error has nothing to do with the size of your output but by the size of your input. You either have way too big images coming into your network in which you may need to use transforms.Resize() or your batch size is way to big, so you are calling for a huge parralel computation and thus need to lower that number in the dataloader.
The ways to remove a tensor from gpu memory can be done by using
a = torch.tensor(1)
del a
# Though not suggested and not rlly needed to be called explicitly
torch.cuda.empty_cache()
The ways to allocate a tensor to cuda memory is to simply move the tensor to device using
a = torch.tensor(1)
a = a.cuda()
# OR
device = torch.device("cuda")
a = a.to(device)
Sarthak Jain

why "RuntimeError CUDA out of memory" in testing?

The same model ran fine for training with batch-size=5. I reduced the batch size from 80 to 5 during training because of the same error. I am using a GPU with 11GB of memory instead of Titan X (12GB memory), the one used by the author in actual experiment.
However, now in testing, which only has batch-size=1, it is not running.
The issue is in I-frame model testing phase, the other two models have successfully produced results for testing.
Following is my testing command:
time python test.py --arch resnet152 --data-name ucf101 --representation iframe --data-root data/ucf101/mpeg4_videos --test-list data/datalists/ucf101_split1_test.txt --weights ucf101_iframe_model_iframe_model_best.pth.tar --save-scores iframe_score_file
I have used nvidia-smi to make sure nothing else is running on the GPU.
Following is the actual error message:
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 10.92 GiB total capacity; 10.12 GiB already allocated; 245.50 MiB free; 21.69 MiB cached)
What could be the issue and how it can be fixed?
EDIT: By removing the following two lines from test.py, it starts running without an memeory issue, but it is taking ages to process:
net = torch.nn.DataParallel(net.cuda(devices[0]), device_ids=devices)
net.eval()
Yes, the above lines are for GPU based parallel processing.
Still, is there a solution to my problem?
I suggest that you may check your test code first.
You can try:
with torch.no_grad():
It will reduce memory consumption for computations that would otherwise have requires_grad=True.
Original Answer(you can try it if you have a bigger GPU):
Maybe the model itself and parameters take up a lot of memory.
You can try "batch-size=1" on your Titan X GPU which you used before and watch whether GPU memory usage is more than 11 GB. If so, the GPU you use now(11 GB memory) may not suitable for this work.
I have run this model/testing on GPU with memory upto 8GB, by adding the following flag in the testing command given in the question:
--test-crops 1

Categories

Resources