I'm trying to run EnyaHermite's pytorch implementation of PicassoNet-II (https://github.com/EnyaHermite/Picasso) on a Ubuntu 18.04.6 LTS GPU cluster and I encounter the following error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorInvalidAddressSpace: operation not supported on global/shared address space
The framework is utilizing a few CPP functions in its main python script and one of them is this decimate_gpu.cu file (https://github.com/EnyaHermite/Picasso/blob/main/pytorch/picasso/mesh/modules/source/decimate_gpu.cu).
I cannot debug the file since I'm running it on a GPU cluster, I only know the crash happends because of this file.
I've only seen one similar issue (here: https://forums.developer.nvidia.com/t/invalidaddressspace-when-using-pointer-from-continuation-callable-parameters/184951/7). The issue in that post was an incorrect definition of a callable, so a change from __global__ to __device__ made it work.
I'm not sure if this error is similar, however I have no idea how to fix it.
Best,
Bjonze
Related
I'm using Anconda to run my Transformers project locally in google colab.
I've created a new environment (tf_gpu) and installed (supposedly) everything I need.
And everything works fine, but when I try to simply import pytorch, this error appears:
[WinError 206] The filename or extension is too long: 'C:\\Users\\34662\\anaconda3\\envs\\tf_gpu\\lib\\site-packages\\torch\\lib'
When clearly the path is not long enough to trigger this error.
My python version is 3.8, and my GPU is a Nvidia GeForce GTX 1650, so it shouldn't be a GPU problem
Does anybody knows why this happens?
Any help is good at this point, I don't know how to solve this.
Here I leave a screenshot of the complete error message
Thank you in advance.
Your problem is that the error ist not a too long path error it is a file not found error which mean that pytorch is not correctly installed
This is bugging me as some weird behaviour and I'm hoping someone has met with a similar situation.
Basically my application starts in Nvidia Docker2, and shows the no CUDA-capable device is detected error until I add a line torch.cuda.is_available(), then it magically starts working again.
I've only managed to gather that its not some race condition on what I can control, and calling other cuda runtime commands don't have the same effect.
So now I need to add the line somewhere before I run my pytorch application and I'm baffled as to why?
edit 1:
torch.cuda.init() also works, so the question is then why does torch.cuda.is_available() work
Error
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/Conv2D (defined at C:\Users\Rajshree\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_808]
I'm running 2 programs . Both are using this and performing similar task of recognizing human expression. The difference only lies in the model of CNN they are using. One is working perfectly fine so what could be the possible problem with the other.
This could be due to a multitude of things. Are you running the two programs at the same time? Are you running this on a GPU? If so, it could be that one is already using the GPU, and the other finds that the GPU is already in use so it throws an error.
I'm trying to run a tensorflow python script in a google cloud vm instance with GPU enabled. I have followed the process for installing GPU drivers, cuda, cudnn and tensorflow. However whenever I try to run my program (which runs fine in a super computing cluster) I keep getting:
undefined symbol: cudnnCreate
I have added the next to my ~/.bashrc
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/usr/local/cuda-8.0/lib64"
export CUDA_HOME="/usr/local/cuda-8.0"
export PATH="$PATH:/usr/local/cuda-8.0/bin"
but still it does not work and produces the same error
Answering my own question: The issue was not that the library was not installed, the library installed was the wrong version hence it could not find it. In this case it was cudnn 5.0. However even after installing the right version it still didn't work due to incompatibilities between versions of driver, CUDA and cudnn. I solved all this issues by re-installing everything including the driver taking into account tensorflow libraries requisites.
I am trying to run a Keras script on an AWS instance. While the script runs fine on my own computer (Python 2.7 - no CPU) it causes an error on AWS. I have installed the latest version of Theano - and other scripts (e.g. the mnist tutoral) do not give errors. The script that is causing the issue is a standard Keras tutoral script (https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py). The error is copied in per below (apologies - there might be a better way to capture errors straight from the command line). Any help much appreciated
First page of error message:
End of error message (i have not copied in the entire stack of keras/layers errors)
Somehow you're passing a symbolic value for the border_mode parameter. If this works fine on CPU but not on GPU then, for some reason, the CPU version of the code supports symbolic border modes but the GPU version does not.
If you can, change the border_mode parameter value to be a Python literal instead of a Theano symbolic variable.