RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch - python

I am trying to run a simple pytorch sample code. It's works fine using CPU. But when using GPU, i get this error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 263, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 260, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The code i am trying to run is the following:
import torch
from torch import nn
m = nn.Conv1d(16, 33, 3, stride=2)
m=m.to('cuda')
input = torch.randn(20, 16, 50)
input=input.to('cuda')
output = m(input)
I am running this code in a NVIDIA docker with CUDA version 10.2 and my GPU is a RTX 2070

There is some discussion regarding this here. I had the same issue but using cuda 11.1 resolved it for me.
This is the exact pip command
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

In my case it actually had nothing do with the PyTorch/CUDA/cuDNN version. PyTorch initializes cuDNN lazily whenever a convolution is executed for the first time. However, in my case there was not enough GPU memory left to initialize cuDNN because PyTorch itself already held the entire memory in its internal cache. One can release the cache manually with "torch.cuda.empty_cache()" right before the first convolution that is executed. A cleaner solution is to force cuDNN initialization at the beginning by doing a mock convolution:
def force_cudnn_initialization():
s = 32
dev = torch.device('cuda')
torch.nn.functional.conv2d(torch.zeros(s, s, s, s, device=dev), torch.zeros(s, s, s, s, device=dev))
Calling the above function at the very beginning of the program solved the problem for me.

I am also using Cuda 10.2. I had the exact same error when upgrading torch and torchvision to the latest version (torch-1.8.0 and torchvision-0.9.0). Which version are you using?
I guess this is not the best solution but by downgrading to torch-1.7.1 and torchvision-0.8.2 it works just fine.

In my cases this error occurred when trying to estimate loss.
I used a mixed bce-dice loss.
It turned out that my output was linear instead of sigmoid.
I then used the sigmoid predictions as of bellow and worked fine.
output = torch.nn.Sigmoid()(output)
loss = criterion1(output, target)

I had the same issue when I was training yolov7 with a chess dataset. By reducing batch size from 8 to 4, the issue was solved.

In my problem
i used to kill exisiting process in gpu.Use nvidia-smi to check what are the process are running.Use killall -9 python3(what process you want) to kill process.After freeup space then run the process.

In my case, I had an array indexing operation but the index was out of bounds. CUDA did not tell me that. I was using inference on a neural network. So I moved to CPU instead of the GPU. The logs were much more informative after that. For debugging if you see this error, switch to CPU first and you will know what to do.

Related

MemoryError Precedes BrokenPipeError While Training CNN

Generally for running pip with no cache we use --no-cache-dir, like
pip install pytorch --no-cache-dir.
I downloaded a CNN model I want to use from github.
The first two lines of execution
python generate_dataset.py --is_train=True --use_phase=True --chip_size=100 --patch_size=94 --use_phase=True --dataset=soc
python generate_dataset.py --is_train=False --use_phase=True --chip_size=128 --patch_size=128 --use_phase=True --dataset=soc
executed succesfully. But while running
python train.py --config_name=config/AConvNet-SOC.json
It is giving MemoryError.
The publisher of above repository is using 32GB RAM and 11 GB GPU. But I have 8 GB RAM and 8GB GPU.
Here is what I have done:
I thought of running it without cache. like,
python train.py --config_name=config/AConvNet-SOC.json --no-cache-dir
But it is throwing below error
FATAL Flags parsing error: Unknown command line flag 'no-cache-dir' Pass --helpshort or --helpfull to see help on flags.
I think it is because no-cache-dir argument is not defined in it by using absl.flags. Does python supports using no chache directory implementation
I am able to solve it by decreasing the number of epochs and batch_size. But I want to run it for full epochs.
Using zeo_grad() of Pytorch makes the gradients zero for every minibatch, so that GPU won't run out of memory. But it is already used in the code I am running in _base.py. Is there anyway I can leverage more of this.
How to resolve this.

PyTorch | loss.backward() -> Missing XLA configuration

The loss is calculated from the target model created using pytorch (not TensorFlow) and when propagating, I run the code below and had trouble with the following error message.
loss.backward()
(Forward propagation can be calculated without problems.)
terminate called after throwing an instance of 'std::runtime_error'
what(): tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
Aborted
-pytorch(1.12.0+cu102)
torchvision(0.13.0+cu102) <- target model contains pre-trained CNN model which can be installed from torchvision.models
google-compute-engine
GPU (NVIDIA Tesla T4 x 1, 11.6) <- The code worked in the environment where GPU (11.2) was installed, but it does not work in the current environment. / In the current environment, the same error occurs even if the GPU is not used and the CPU is used.
TPU is not installed (I don't want to use TPU, but GPU)
The code is working locally and was also working on other GPU environments as mentioned above. It stopped working when the environment was updated.
Please help me···

I solved this problem with the command.
$ pip uninstall torch_xla
This error seemed to be caused by pytorch-ignite and torch_xla.

Training MaskRCNN on custom data issue

I am trying to train Mask RCNN on a custom dataset of floorplans. I am following this article on Medium to do this: https://medium.com/analytics-vidhya/a-simple-guide-to-maskrcnn-custom-dataset-implementation-27f7eab381f2 .
After having some issues with annotation formats and packages I got around to training the model. However, I stumbled upon the following error code:
Traceback (most recent call last):
File "custom.py", line 391, in <module>
train(model)
File "custom.py", line 222, in train
layers='heads')
File "C:...\Custom_MaskRCNN-master\mrcnn\model.py", line 2356, in train
self.compile(learning_rate, self.config.LEARNING_MOMENTUM)
File "C:...\Custom_MaskRCNN-master\mrcnn\model.py", line 2201, in compile
self.keras_model.add_metric(loss, name)
AttributeError: 'Model' object has no attribute 'add_metric'
I could not find anything about this error and was hoping someone could help out or give me an indication on how to fix this.

As we can see, the requirements file does not specify an exact TF & Keras version but, only a lower limit.
#requirements.txt
numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython[all]
When your env was created, the most recent versions of TensorFlow & Keras would have got installed. The 'add_metric' method might be deprecated or moved to another class in the the latest version that got installed, as there have been major version updates to these frameworks. Please note that the author of the repo associated with this article has not been updated it in the last two years. Even the author of the original repo which this repo is based on, has not yet updated it (Original repo: https://github.com/matterport/Mask_RCNN). It is very likely that you're going to face more errors once the current one is solved.
One way to solve this issue would be to downgrade TF and Keras versions (tensorflow to 1.3.0, keras to 2.0.8 may resolve it).
The best course of action would be to port the code using the official conversion tools provided by TensorFlow to convert the TF1.x code to TF2.x or to use a repo in which the code has already been converted.
MaskRCNN Repo with Updated TF and Keras: https://github.com/ahmedfgad/Mask-RCNN-TF2
Hope that helps! Cheers :)

change that line as follows,
FROM:
self.keras_model.add_metric(loss, name)
TO:
self.keras_model.metrics_tensors.append(loss)

Tensorflow: failed to create session in server

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:
Traceback (most recent call last):
File "inception_resnet.py", line 246, in <module>
callbacks=[checkpoint, saveEpochNumber]) ##
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
class_weight=class_weight)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
outputs = self.train_function(ins)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
session = get_session()
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
_SESSION = tf.Session(config=config)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
So the error is actually
tensorflow.python.framework.errors_impl.InternalError: Failed to
create session.
And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.
I have seen an answer here, but when I execute the following command in terminal
export CUDA_VISIBLE_DEVICES=''
training of the model gets started without GPU acceleration.
Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

I found the solution in a comment of this question.
nvidia-smi -q
This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using
kill -9 PID
Now everything is running smooth again.

I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116
and also faced the same issue.
In my case this was caused by incompatible cudatoolkit version
conda install tensorflow-gpu
installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers here and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that cudatookit 9.0.0 will be compatible with my driver version.
conda install cudatoolkit==9.0.0
This installed cudatoolkit 9.0.0 and cudnn 7.3.0 from cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.
Now coming to the options of killing jobs
If you have GPU memory occupied by other jobs then killing them one by one as suggested by #Preetam saha arko will free up GPU and may allow you to create tf session with GPU (provided that compatibility issues are resolved already)
To create Session with specified GPU, kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to available GPU ID (0 for this example)
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
Then using tf.Session can create session with specified GPU device.
Otherwise, if nothing with GPU works then kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to undefined
import os
os.environ["CUDA_VISIBLE_DEVICES"]=''
Then using tf.Session can create session with CPU.

I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :
InternalError: Failed to create session
Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session.
The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.
Below is the log error for jupyter notebook :
Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600

Optimization failure in theano

I am using Fedora with the Anaconda Python environment. I have a 960m nvidia gpu, for which I have installed the required drivers and the CUDA toolkit. But when I try to run the theano tests, I end up getting the following error (in a huge error output):
EE.EEEERROR (theano.gof.opt): Optimization failure due to: constant_folding
ERROR (theano.gof.opt): node: DimShuffle{x}(TensorConstant{2})
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
I was trying to compile a simple function y, when I first saw the error. Searching for a solution led me to find that a lot of people had the same problem with the test function, but without any definite solutions. I followed the theano documentations and set the $CUDA_ROOT to my cuda root folder, but to no avail.
I'm using theano version 0.8.2 and Numpy 1.11.1, both from the conda repos. Seems like it is a GPU issue. But if it has problems, shouldn't it fallback to the CPU?
Any help would be highly appreciated. Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch - python

There is some discussion regarding this here. I had the same issue but using cuda 11.1 resolved it for me. This is the exact pip command pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

I am also using Cuda 10.2. I had the exact same error when upgrading torch and torchvision to the latest version (torch-1.8.0 and torchvision-0.9.0). Which version are you using? I guess this is not the best solution but by downgrading to torch-1.7.1 and torchvision-0.8.2 it works just fine.

In my cases this error occurred when trying to estimate loss. I used a mixed bce-dice loss. It turned out that my output was linear instead of sigmoid. I then used the sigmoid predictions as of bellow and worked fine. output = torch.nn.Sigmoid()(output) loss = criterion1(output, target)

I had the same issue when I was training yolov7 with a chess dataset. By reducing batch size from 8 to 4, the issue was solved.

In my problem i used to kill exisiting process in gpu.Use nvidia-smi to check what are the process are running.Use killall -9 python3(what process you want) to kill process.After freeup space then run the process.

Related

MemoryError Precedes BrokenPipeError While Training CNN

PyTorch | loss.backward() -> Missing XLA configuration

Training MaskRCNN on custom data issue

Tensorflow: failed to create session in server

Optimization failure in theano

Categories

Resources