I am trying to finetune an Spacy NER model using BERT a
#Train the data
!python -m spacy train -g 0 config_spacy_bert.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
Batch size in the config file is 2 and I am getting an error
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 1.96 GiB total capacity; 958.13 MiB already allocated; 11.25 MiB free; 978.00 MiB reserved in total by PyTorch)
How is it possible to remove this error
You don't have enough GPU memory. You need to get a bigger GPU, or not use BERT, or use a smaller model.
The recommended GPU memory size with spaCy is 10GB; you can sometimes make do with 8GB or slightly less, but it looks like you only have 2GB, which is just not enough.
Related
I'm training a huggingface xlnet-large-cased model with the following specs:
args = TrainingArguments( f"xlnet-large-finetuned", evaluation_strategy = "epoch", save_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=1, per_device_eval_batch_size=1, num_train_epochs=3, gradient_accumulation_steps=16, weight_decay=0.01, load_best_model_at_end=True, metric_for_best_model="accuracy" )
and by calling this code: trainer = Trainer( model, args, train_dataset=tokenized_train_dataset, eval_dataset=tokenized_val_dataset, data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics ) , trainer.train().
I reduced the batch size to 1, emptied cuda cache and deleted all the variables in gc but I still get this error: RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 15.78 GiB total capacity; 14.31 GiB already allocated; 2.75 MiB free; 14.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there any way I could resolve this without having to acquire more GPU credits?
There is a method named "Mixed Precision", the idea is to convert parameters from float32 to float16 to speed up the training and reduce memory use, the detail of mixed precision.
In some repositories, you can see they implement "automatic mixed precision" by apex package. However, with the newest version of Pytorch, you can use it easily with torch.cuda.amp by wrapping the computation code in autocast() and control the gradient and loss scale by the scaler.
I think I'm running Tensor PyTorch.
I am new to python, and trying to use it experimenting with convolutional Neural networks and processing larger images. But I keep running into this error, even if I Request smaller image outputs. I just signed up for Colab Pro. While it is certainly faster, it still errors out with the CUDA. I would reallocate memory if I knew how, but I don't. Are there any other other way to access/manage GPU memory??
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line
255, in backward torch.autograd.backward(self, gradient, retain_graph,
create_graph, inputs=inputs) File
"/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py",
line 149, in backward allow_unreachable=True, accumulate_grad=True) #
allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to
allocate 114.00 MiB (GPU 0; 15.78 GiB total capacity; 13.27 GiB
already allocated; 4.75 MiB free; 14.43 GiB reserved in total by
PyTorch) VGG-19 Architecture Detected Successfully loaded
models/vgg19-d01eb7cb.pth conv1_1: 64 3 3 3 conv1_2: 64 64 3 3
conv2_1: 128 64 3 3 conv2_2:
I have shown below ways to manage GPU memory in pytorch, but often these ways are not suggested ways to deal with CUDA Errors like yours.
The reason you get this error has nothing to do with the size of your output but by the size of your input. You either have way too big images coming into your network in which you may need to use transforms.Resize() or your batch size is way to big, so you are calling for a huge parralel computation and thus need to lower that number in the dataloader.
The ways to remove a tensor from gpu memory can be done by using
a = torch.tensor(1)
del a
# Though not suggested and not rlly needed to be called explicitly
torch.cuda.empty_cache()
The ways to allocate a tensor to cuda memory is to simply move the tensor to device using
a = torch.tensor(1)
a = a.cuda()
# OR
device = torch.device("cuda")
a = a.to(device)
Sarthak Jain
Can anybody help me to explain the meaning of this common problem in Pytorch?
Model: EfficientDet-D4
GPU: RTX 2080Ti
Batch size: 2
CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 11.00 GiB total capacity; 8.32 GiB already allocated; 2.59 MiB free; 8.37 GiB reserved in total by PyTorch)
Anyway, I think the model and GPU are not important here and I know the solution should be reduced batch size, try to turn off the gradient while validating, etc. But I just want to know what is the meaning of 8.32 GiB while I have 11 GiB but can not allocate 14.00 MiB more?
Addition: I try to watch nvidia-smi while training with batch size = 1, it took 9.5 GiB in my GPU.
I have the answer from #ptrblck in the Pytorch community. In there, I described my question in more detail than this question.
Please check the answer in here .
I'm just playing around with pytorch and I'm wondering why it consumes so much memory of my GPU?
I'm using Cuda 10.0 with pythorch 1.2.0 and torchvision 0.4.0.
import torch
gpu = torch.device("cuda")
x = torch.ones(int(4e8), device=gpu)
y = torch.ones(int(1e5), device=gpu)
Running the above code I get the error:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 2.00 GiB total capacity; 1.49 GiB already allocated; 0 bytes free; 0 bytes cached)
So, does pytorch needs ~500MB of the gpu memory as overhead? Or what is the problem here?
More information and testing done by xymeng in github could be seen in the given link
Referencing xymeng's words :
PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+
Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch's kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.
Seems it suits your situation.
The same model ran fine for training with batch-size=5. I reduced the batch size from 80 to 5 during training because of the same error. I am using a GPU with 11GB of memory instead of Titan X (12GB memory), the one used by the author in actual experiment.
However, now in testing, which only has batch-size=1, it is not running.
The issue is in I-frame model testing phase, the other two models have successfully produced results for testing.
Following is my testing command:
time python test.py --arch resnet152 --data-name ucf101 --representation iframe --data-root data/ucf101/mpeg4_videos --test-list data/datalists/ucf101_split1_test.txt --weights ucf101_iframe_model_iframe_model_best.pth.tar --save-scores iframe_score_file
I have used nvidia-smi to make sure nothing else is running on the GPU.
Following is the actual error message:
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 10.92 GiB total capacity; 10.12 GiB already allocated; 245.50 MiB free; 21.69 MiB cached)
What could be the issue and how it can be fixed?
EDIT: By removing the following two lines from test.py, it starts running without an memeory issue, but it is taking ages to process:
net = torch.nn.DataParallel(net.cuda(devices[0]), device_ids=devices)
net.eval()
Yes, the above lines are for GPU based parallel processing.
Still, is there a solution to my problem?
I suggest that you may check your test code first.
You can try:
with torch.no_grad():
It will reduce memory consumption for computations that would otherwise have requires_grad=True.
Original Answer(you can try it if you have a bigger GPU):
Maybe the model itself and parameters take up a lot of memory.
You can try "batch-size=1" on your Titan X GPU which you used before and watch whether GPU memory usage is more than 11 GB. If so, the GPU you use now(11 GB memory) may not suitable for this work.
I have run this model/testing on GPU with memory upto 8GB, by adding the following flag in the testing command given in the question:
--test-crops 1