jax woes (on an NVDIA DGX box, no less)

jax woes (on an NVDIA DGX box, no less) - python

I am trying to run jax on an nvidia dgx box, but am failing miserably, thus:
>>> import jax
>>> import jax.numpy as jnp
>>> x = jnp.arange(10)
2021-10-25 13:00:05.863667: W
external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't
get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2021-10-25 13:00:05.864713: F
external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:435]
ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to
launch ptxas' If the error message indicates that a file could not be written,
please verify that sufficient filesystem space is provided.
Aborted (core dumped)
Any suggestions would be much appreciated.

This means that your CUDA installation is not configured correctly, and can generally be fixed by ensuring that the CUDA toolkit binaries (including ptxas) are present in your $PATH. See https://github.com/google/jax/discussions/6843 and https://github.com/google/jax/issues/7239 for responses to users reporting similar issues.

For this problem you need to install nvidia-driver, cuda and cudnn correctly and the risky command here would be: sudo apt install nvidia-cuda-toolkit avoid this command if you have installed those 3 already.
the way which works for me:
Install nvidia-driver: follow this and proper version also. you can try sudo ubuntu-drivers devices in ubuntu
Install cuda : for finding which cuda version works for you run nvidia-smi and on top-left you will see compatible version for the cuda then go nvidia cuda archive and follow the instructions there.
at this step you should be able to see cuda foder when you type ls /usr/local. if you want to install header also you can find useful command from nvidia installation guide for cuda.
Install cudnn which means copy paste some files into /usr/local/cuda directory if you go through cuDNN nvidia guide you would find the best way.
the last step you need to refer to the cuda path (/usr/local/cuda if you follow above). for example if you use docker you need to mount it like here. avoid install nvidia-cuda-toolkit it would remove your previous installation and instead you can install it in conda-env by conda install -c nvidia cuda-nvcc which doesn't interfere your cuda installation.

Related

Do I need to install CUDA driver for tensorflow-gpu manually if I install tf through conda

I followed this tutorial and installed tf-gpu using conda (https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/) and it worked because I am seeing "...gpu:0" in my printed out log. Before I did the installation, I already have CUDA driver installed, so I am not sure.
Seems to me that conda install tensorflow-gpu comes with cuda toolkit and cuDNN,etc. I was wondering if installing CUDA driver is a require step. Another post I found did't mention driver either (https://towardsdatascience.com/tensorflow-gpu-installation-made-easy-use-conda-instead-of-pip-52e5249374bc). But the official GPU guide says it's required, so I am confused. I am doing it on Windows 10.

In my experience you do not need to install cuda or cudnn. Just your graphics driver is enough.
But depending on your system it might not be optimized. For that you would need to compile tensorflow from scratch and optimize it for your system.

Depends on the machine you are running on. For example, you can configure a Google Deep Learning VM to install the NVIDIA driver on startup.
If the driver is not installed, then follow the Tensorflow instructions on how to install the NVIDIA driver. Here are the instructions for Linux. Note that you only need to install the driver, and not the toolkit.

dlib not using CUDA

I installed dlib using pip. my graphic card supports CUDA, but while running dlib, it is not using GPU.
Im working on ubuntu 18.04
Python 3.6.5 (default, Apr 1 2018, 05:46:30)
[GCC 7.3.0] on linux
>>> import dlib
>>> dlib.DLIB_USE_CUDA
False
I have also installed the NVidia Cuda Compile driver but still it is not working.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
Can anyone help me how to get it working. ?

I had similar issues, in my case I was missing the cuDNN library, which prevented dlib from compiling with CUDA instructions, although I had CUDA compiler and other drivers installed.
The next part is to download dlib from this repo.
Then run this command to install dlib with CUDA and AVX instructions, you do not need to manually compile it with CMake using make file:
python setup.py install --yes USE_AVX_INSTRUCTIONS --yes DLIB_USE_CUDA
Important part now is to read the log, if the python can actually find CUDA, cuDNN and can use CUDA compiler to compile the test project. These are the important lines:
-- Found CUDA: /usr/local/cuda/bin/ (found suitable version "8.0", minimum required is "7.5")
-- Looking for cuDNN install...
-- Found cuDNN: /usr/local/cuda/lib64/libcudnn.so
-- Building a CUDA test project to see if your compiler is compatible with CUDA...
The second problem I was facing was related to CMake versions. The latest version had some known problems with cuda and dlib, so I had to install CMake 3.12.3 in order to make it work.

There are 2 different problems leading to this as on Windows:
You don't have a CUDA installation or cuDNN installation.
You installed the above 2 libraries but didn't initialize environment variables. This is specially true for conda install of both libraries. Conda installs them but doesn't setup environment variables. Full point of conda is not to set them globally.
This is something I'm unsure about but might fix. The name of environment variable is CUDA_PATH_xxxx and not CUDA_PATH as was given in installation instruction of Nvidia website.
Try the third one if first 2 corrections, didn't work. My CUDA version is 10.1 at the time.

We had the exact same issue where the CUDA drivers were installed properly but the dlib.DLIB_USE_CUDA flag was 'False'.
Installing dlib via 'pip3 install -v dlib' shows that it was picking up a different version of the C++ compiler that is not compatible.
Installing Visual Studio 14 2015 solved this issue for us.
One thing to note is that we got the message that dlib WILL use cuda when we tried to install using the command 'python setup.py install' from the source code, but the dlib.DLIB_USE_CUDA flag was still set to False.

Tensorflow: ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

I have recently installed tensorflow-gpu using pip. But when I am importing it it is giving the following error:
ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
I have gone through all the answers of stackoverflow related to this issue but none of them worked for me.
libcudnn.so.7 is present in both the following directories /usr/local/cuda/lib64 and /usr/local/cuda-9.0/lib64 .
Also, I have added the following path in my .bashrc file:
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Please help me in resolving this

You are setting LD_LIBRARY_PATH in the wrong way, I would recommend to do it this way (which is kind of the standard):
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

You might need to download and install NVIDIA cuDNN.
Download it from https://developer.nvidia.com/rdp/cudnn-download
(You have to register an account to download if you don't have). The runtime version is usually more stable than the developer version.

Reinstalling CudNN-7.0.5, (make sure you pick the right version from the link below) fixed this for me.
You'll need to log in to your Nvidia developer account to access the link. (If you don't have an Nvidia account, creating one is straight forward);
https://developer.nvidia.com/rdp/cudnn-archive
Installation instructions for CudNN;
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
But I also encountered the following error;
Loaded runtime CuDNN library: 7.0.5 but source was compiled with: 7.4.2. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Therefore, I had to once again download and install the right CuDNN version, i used the information from the above error message and installed CuDNN 7.4.2 and this fixed all the errors and everything worked fine.
Good Luck!

you add the following path in your .bashrc file:
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

FWIW is interested I created a shell script which installs different CUDA versions in Debian which can be easily ported to Ubuntu:

The reason is that some libraries are missing.
Try installing
sudo apt install libcudnn7

Error with tensorFlow

I have some problem with tensorFlow. I'm trying to install it with GPU on my manjaro linux with GTX 1060.
When I try to import tensorFlow in python with:
import tensorflow as tf
I get this error:
{...} ImportError: libcublas.so.8.0: cannot open shared object file:
No such file or directory {...}
With pip, I have installed tensorFlow-gpu:sudo pip install tensorflow-gpu
When I try to install cuda-8.0 (with pacaur -Syu cuda-8.0), after a very long loading, I got an error. Now when I try to install it, it does this:
Errors occurred, no packages were upgraded
Even if it's not on my pacaur list, and there is no reinstalling signed
I have install Keras with: sudo pip install Keras
I have install cudNN with: pacaur -Syu cudnn
I have installed my nvidia driver with (if I remember it right):pacaur -Syu nvidia

I am not familiar with manjaro. Assume you wanna install TensorFlow 1.4, the order would be:
Install latest Nvidia driver (version 384.xx or higher). Check its status in a terminal with nvidia-smi.
Install CUDA 8.0 without the GPU driver (as you have done it in step 1).
Add PATH=/usr/local/cuda-8.0/bin to the environment (in Ubuntu it's /etc/environment).
Added driver and CUDA paths to LD_LIBRARY_PATH. In Ubuntu, it is done by adding export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/lib/nvidia-384:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} to /etc/bash.bashrc. At this point, you should be able to check CUDA version by nvcc --version.
Copy CUDNN files to somewhere and add that path to LD_LIBRARY_PATH. CUDNN needs no installation.
Install TensorFlow 1.4.
If you wanna install other versions of TensorFlow, you need to first check the supported versions of CUDA and CUDNN.
Hope this helps.

Tensorflow 0.7.1 with Cuda Toolkit 7.5 and cuDNN 7.0

I recently tried to upgrade my Tensorflow installation from 0.6 to 0.7.1 (Ubuntu 15.10, Python 2.7) because it is described to be compatible with more up-to-date Cuda libraries. Everything works well including the simple test from the Tensorflow getting started page. However I'm not able to use cuDNN. When running a program using cuDNN, I first get a warning
"Unable to load cuDNN DSO"
and later the program crashes with
I tensorflow/core/common_runtime/gpu/gpu_device.cc:717] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:73] Allocating 3.30GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:83] GPU 0 memory begins at 0x704a80000 extends to 0x7d80c8000
F tensorflow/stream_executor/cuda/cuda_dnn.cc:204] could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate
The files I downloaded for the Cuda Installation were
cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
and
cudnn-7.0-linux-x64-v4.0-prod.tgz
I followed the instructions on the Tensorflow getting started page with the exception of using cuDNN 7.0 instead of 6.5. $LD_LIBRARY_PATH is
"/usr/local/cuda/lib64"
I have no clue why cudnnCreate is not found. Is there somebody who has successfully installed this configuration and can give me advice?

I get the same error when I forgot to set the LD_LIBRARY_PATH and CUDA_HOME environment variables:
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda

I am following this instructions to install TensorFlow in archlinux:
https://github.com/ddigiorg/AI-TensorFlow/blob/master/install/install-TF_2016-02-27.md
It seems you need cuDNN v2 or above, which you can get by register for their Accelerated Computing Developer Program, which usually takes 2 days:
https://developer.nvidia.com/accelerated-computing-developer
UPDATE: It seems you already have cuDNNv2

The link sent by jorgemf (thank you) describes a Python 3.5 installation and I almost switched to Python 3.5.
My last attempt with my present installation was to again copy the cuDNN libraries to /usr/local/cuda/lib64.
And it worked! So the problem is solved, although I still don't know why I had it.

Errorsolving for windows 10 users:
Download cuDNN v5.1 Library for Windows 10 from the cuda site,
register if necessary.
Copy the cudnn64_5.dll (cuda\bin\cudnn64_5.dll) from that zip
archive into
C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\bin\;
If C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0 is your install PATH
for the CUDA toolkit.

Ubuntu 14.04 && cudnnV5.0 && Cuda7.5
I got the some error and solve it in another way.
Follow the official get-started page, I install the cudnn with these commands below, which is basically just copy those files into our cuda directory
https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#optional-install-cuda-gpus-on-linux
tar xvzf cudnn-7.5-linux-x64-v5.1-ga.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
But after doing this ,if we use ll command to show all the file in "/usr/local/cuda/lib64" and compare with the origin files
ll
it seems that those soft links has broken after copy.
so I delete them and create manually, like this:
sudo rm libcudnn.so.5 libcudnn.so
sudo ln -sf libcudnn.so.5 libcudnn.so
sudo ln -sf libcudnn.so.5.1.3 libcudnn.so.5
after that, execute
sudo ldconfig /usr/local/cuda/lib64
and it finally worked!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

jax woes (on an NVDIA DGX box, no less) - python

Related

Do I need to install CUDA driver for tensorflow-gpu manually if I install tf through conda

dlib not using CUDA

Tensorflow: ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

Error with tensorFlow

Tensorflow 0.7.1 with Cuda Toolkit 7.5 and cuDNN 7.0

Categories

Resources