Tensorflow: failed to create session in server - python

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:
Traceback (most recent call last):
File "inception_resnet.py", line 246, in <module>
callbacks=[checkpoint, saveEpochNumber]) ##
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
class_weight=class_weight)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
outputs = self.train_function(ins)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
session = get_session()
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
_SESSION = tf.Session(config=config)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
So the error is actually
tensorflow.python.framework.errors_impl.InternalError: Failed to
create session.
And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.
I have seen an answer here, but when I execute the following command in terminal
export CUDA_VISIBLE_DEVICES=''
training of the model gets started without GPU acceleration.
Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

I found the solution in a comment of this question.
nvidia-smi -q
This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using
kill -9 PID
Now everything is running smooth again.

I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116
and also faced the same issue.
In my case this was caused by incompatible cudatoolkit version
conda install tensorflow-gpu
installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers here and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that cudatookit 9.0.0 will be compatible with my driver version.
conda install cudatoolkit==9.0.0
This installed cudatoolkit 9.0.0 and cudnn 7.3.0 from cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.
Now coming to the options of killing jobs
If you have GPU memory occupied by other jobs then killing them one by one as suggested by #Preetam saha arko will free up GPU and may allow you to create tf session with GPU (provided that compatibility issues are resolved already)
To create Session with specified GPU, kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to available GPU ID (0 for this example)
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
Then using tf.Session can create session with specified GPU device.
Otherwise, if nothing with GPU works then kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to undefined
import os
os.environ["CUDA_VISIBLE_DEVICES"]=''
Then using tf.Session can create session with CPU.

I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :
InternalError: Failed to create session
Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session.
The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.
Below is the log error for jupyter notebook :
Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600

Related

PyTorch | loss.backward() -> Missing XLA configuration

The loss is calculated from the target model created using pytorch (not TensorFlow) and when propagating, I run the code below and had trouble with the following error message.
loss.backward()
(Forward propagation can be calculated without problems.)
terminate called after throwing an instance of 'std::runtime_error'
what(): tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
Aborted
-pytorch(1.12.0+cu102)
torchvision(0.13.0+cu102) <- target model contains pre-trained CNN model which can be installed from torchvision.models
google-compute-engine
GPU (NVIDIA Tesla T4 x 1, 11.6) <- The code worked in the environment where GPU (11.2) was installed, but it does not work in the current environment. / In the current environment, the same error occurs even if the GPU is not used and the CPU is used.
TPU is not installed (I don't want to use TPU, but GPU)
The code is working locally and was also working on other GPU environments as mentioned above. It stopped working when the environment was updated.
Please help me···
I solved this problem with the command.
$ pip uninstall torch_xla
This error seemed to be caused by pytorch-ignite and torch_xla.

PyTorch CUDA GPU not utilized properly

I am trying to train a pytorch model on my local machine. It has the following GPUs:
As you see the second is NVIDIA and thus should be used with CUDA. In fact if I check torch.cuda.device_count() it returns 1 and torch.cuda.get_device_name() returns NVIDIA GeForce 930MX. When I run the script however the usage of the built-in Intel GPU goes up to 100% and then the program crashes with:
OSError: [WinError 1450] Insufficient system resources exist to complete the requested service
The usage (as seen from the task manager) of the targeted GPU (NVIDIA) remains 0% so it has not been called.
What configuration steps I might have messed up and what would you propose in order to run PyTorch on the proper GPU.
*Using the LTS versions of torch and CUDA as of the day of posting the question.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch

I am trying to run a simple pytorch sample code. It's works fine using CPU. But when using GPU, i get this error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 263, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 260, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The code i am trying to run is the following:
import torch
from torch import nn
m = nn.Conv1d(16, 33, 3, stride=2)
m=m.to('cuda')
input = torch.randn(20, 16, 50)
input=input.to('cuda')
output = m(input)
I am running this code in a NVIDIA docker with CUDA version 10.2 and my GPU is a RTX 2070
There is some discussion regarding this here. I had the same issue but using cuda 11.1 resolved it for me.
This is the exact pip command
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
In my case it actually had nothing do with the PyTorch/CUDA/cuDNN version. PyTorch initializes cuDNN lazily whenever a convolution is executed for the first time. However, in my case there was not enough GPU memory left to initialize cuDNN because PyTorch itself already held the entire memory in its internal cache. One can release the cache manually with "torch.cuda.empty_cache()" right before the first convolution that is executed. A cleaner solution is to force cuDNN initialization at the beginning by doing a mock convolution:
def force_cudnn_initialization():
s = 32
dev = torch.device('cuda')
torch.nn.functional.conv2d(torch.zeros(s, s, s, s, device=dev), torch.zeros(s, s, s, s, device=dev))
Calling the above function at the very beginning of the program solved the problem for me.
I am also using Cuda 10.2. I had the exact same error when upgrading torch and torchvision to the latest version (torch-1.8.0 and torchvision-0.9.0). Which version are you using?
I guess this is not the best solution but by downgrading to torch-1.7.1 and torchvision-0.8.2 it works just fine.
In my cases this error occurred when trying to estimate loss.
I used a mixed bce-dice loss.
It turned out that my output was linear instead of sigmoid.
I then used the sigmoid predictions as of bellow and worked fine.
output = torch.nn.Sigmoid()(output)
loss = criterion1(output, target)
I had the same issue when I was training yolov7 with a chess dataset. By reducing batch size from 8 to 4, the issue was solved.
In my problem
i used to kill exisiting process in gpu.Use nvidia-smi to check what are the process are running.Use killall -9 python3(what process you want) to kill process.After freeup space then run the process.
In my case, I had an array indexing operation but the index was out of bounds. CUDA did not tell me that. I was using inference on a neural network. So I moved to CPU instead of the GPU. The logs were much more informative after that. For debugging if you see this error, switch to CPU first and you will know what to do.

How to release GPU resources in keras in ipython in spyder?

When I typically run a python script from command line, for example, python test.py, the GPU memory will be released just after the script finished.
In this test.py script, I simply loaded a keras built model to evaluate and predict some data. No training process in it.
However, if I open my 'spyder', and run this script in 'spyder', the results come in the 'ipython' section, but then I type nvidia-smi from command line, the GPU memory is not released.
So, what I tried is close this 'ipython' kernel and start a new one. But all my other variables will be lost. Is there a decent way to release the GPU memory after model.evaluate(x, y) from 'spyder'?
Here are some screen shots:
Before and after running the script from 'spyder':
Normally, tensorflow backend will reserve all the memory on the GPU. It may not really use all of the memory, but it will be kept occupied from being used by other programs until tensorflow backend is terminated. So in nvidia-smi you will see the memory is not release even tensorflow has released the previous memory in its framework.

Get sensor values (like Temperature of GPU and CPU) and fan speeds of Windows 10 PC

I've been trying to get a Python script to show temperatures for CPU, GPU and other availabile sensors in my hardware, but I haven't found anything useful.
I tried using WMI to get those values, but my processor is apparently not supported.
The code I used was:
import wmi
w = wmi.WMI(namespace="root\wmi")
temperature_info = w.MSAcpi_ThermalZoneTemperature()[0]
print temperature_info.CurrentTemperature
which I got from another stackoverflow thread, and I get thrown the error
Traceback (most recent call last):
File "C:/Users/Joe/Desktop/test.py", line 3, in <module>
temperature_info = w.MSAcpi_ThermalZoneTemperature()[0]
File "C:\Python27\lib\site-packages\wmi.py", line 819, in query
handle_com_error ()
File "C:\Python27\lib\site-packages\wmi.py", line 241, in handle_com_error
raise klass (com_error=err)
x_wmi: <x_wmi: Unexpected COM Error (-2147217396, 'OLE error 0x8004100c', None, None)>
which, according to Microsoft Support, means Not Supported (0x8004100C)
I have tried running the command-line version of this code in a cmd.exe window ran as an administrator, but I got the same error.
Is there any other way to access CPU and GPU temperatures?
PS: My OS is Windows 10 and my CPU is AMD FX-8350. I am unsure whether my OS or my CPU are at fault for this error.
Here is a way to get your GPU temperature.
Use nvidia-smi tool.
This is .exe file present in the location "C:\Program Files\NVIDIA Corporation\NVSMI".
In the command prompt, just enter:
cd C:\Program Files\NVIDIA Corporation\NVSMI
then type:
nvidia-smi
This will display an output like this:
You can see the GPU temp! (red underline)
Coming to the CPU and Fan speed values, Microsoft apparently does not have a built in functionality to at least showcase these values to the user. But you can try 3rd part applications like MSI Afterburner. But Microsoft strictly warns against this as this might affect the performance.

Categories

Resources