So I am trying to compile TensorFlow from the source (using a clone from their git repo from 2019-01-31). I installed Bazel from their shell script (https://github.com/bazelbuild/bazel/releases/download/0.21.0/bazel-0.21.0-installer-linux-x86_64.sh).
I executed ./configure in the tensorflow code and provided the default settings except for adding my machine specific -m options (-mavx2 -mfma) and pointing python to the correct python3 location (/usr/bin/py3). I then ran the following command as per the tensorflow instructions:
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_framework.so //tensorflow:libtensorflow.so
Now that continues to run and run, I haven't seen it complete yet (though I am limited to letting it run for a maximum of about 10 hours). It produces a ton of INFO: warnings regarding signed and unsigned integers and control reaching the end of non-void functions. None of these appear fatal. Compilation continues to tick with the two numbers continuing to grow ('[N,NNN / X,XXX] 4 actions running') and files ticking through 'Compiling'.
The machine is an EC2 instance with ~16GiB of RAM, CPU is 'Intel(R) Xeon(R) CPU E5-2686 v4 # 2.30GHz' with I believe 4-cores, plenty of HDD space (although compilation seems to eat QUITE a bit, > 1GiB)
Any ideas on what's going on here?
Unfortunately, some programs can take a long time to compile. A couple of hours of compilation is not strange for tensorflow on your setup.
There are reports of it taking 50 minutes on a considerably faster machine
A solution to this problem is to use pre-compiled binaries that are available with pip, instructions can be found here: https://www.tensorflow.org/install/pip.html
Basically you can do this:
pip install tensorflow
If you require a specific older version, like 1.15, you can do this:
pip install tensorflow==1.15
For gpu support you add -gpu to the package name, like this:
pip install tensorflow-gpu
And:
pip install tensorflow-gpu==1.15
Related
Generally for running pip with no cache we use --no-cache-dir, like
pip install pytorch --no-cache-dir.
I downloaded a CNN model I want to use from github.
The first two lines of execution
python generate_dataset.py --is_train=True --use_phase=True --chip_size=100 --patch_size=94 --use_phase=True --dataset=soc
python generate_dataset.py --is_train=False --use_phase=True --chip_size=128 --patch_size=128 --use_phase=True --dataset=soc
executed succesfully. But while running
python train.py --config_name=config/AConvNet-SOC.json
It is giving MemoryError.
The publisher of above repository is using 32GB RAM and 11 GB GPU. But I have 8 GB RAM and 8GB GPU.
Here is what I have done:
I thought of running it without cache. like,
python train.py --config_name=config/AConvNet-SOC.json --no-cache-dir
But it is throwing below error
FATAL Flags parsing error: Unknown command line flag 'no-cache-dir' Pass --helpshort or --helpfull to see help on flags.
I think it is because no-cache-dir argument is not defined in it by using absl.flags. Does python supports using no chache directory implementation
I am able to solve it by decreasing the number of epochs and batch_size. But I want to run it for full epochs.
Using zeo_grad() of Pytorch makes the gradients zero for every minibatch, so that GPU won't run out of memory. But it is already used in the code I am running in _base.py. Is there anyway I can leverage more of this.
How to resolve this.
I try to run Tensorflow with GPU support (GTX 1660 SUPER).
I created an enviroment using anaconda, than installed cudatoolkit (version 11.0.221) and tensorflow-gpu (version 2.4.1). Afterwards, I downloaded cuDNN (version 8.0.4), and copied all files from cuDNN's bin folder to my environment's bin folder at anaconda3\envs\<env name>\Library\bin.
In my script, I've set the memory limit to my GPU's memory using tf.config.experimental.set_memory_growth.
When I run the script (which uses convolutional algorithms), I get a warning that says Couldn't invoke ptxas.exe --version which comes after an Call to CreateProcess failed. Error code: 2 error.
After the launch failure, I get: Relying on driver to perform ptx compilation. Modify $PATH to customize ptxas location.
I've already tried switching to cuDNN version 8.1.1.
How I fix this?
I got a new fix for this.
First I tried using tensorflow=2.3, cudnn=7.6.5 and cudatoolkit=10.1 as mentioned in previous answers. However, every time I put a model to train, the process was going stale and the training seemed to be stuck in epoch 1.
I then managed to include ptxas in my conda environment by running conda install -c nvidia cuda-nvcc The packages I am using are:
tensorflow=2.9, cudnn=8.1.0, cudatoolkit=11.2.2, cuda-nvcc=11.7.99 and python=3.9
I am running everything on windows 10 flawlessly now.
For the benefit of community adding #Zuk Levinson comment
Solves the issue by using
tensorflow=2.3, cudnn=7.6.5 and cudatoolkit=10.1
I have two Windows computers, one with and one without a GPU.
I would like to deploy the same python script on both (TensorFlow 1.8 Object Detection), without changing the packaged version of TensorFlow. In other words, I want to run tensorflow-gpu on a CPU.
In the event where my script cannot detect nvcuda.dll, I've tried using a Session config to disable the GPU, like so:
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
and:
with tf.Session(graph=detection_graph, config=config) as sess:
However, this is not sufficient, as TensorFlow still returns the error:
ImportError: Could not find 'nvcuda.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable.
Typically it is installed in 'C:\Windows\System32'. If it is not present, ensure that you have a CUDA-capable GPU with the correct driver installed.
Is there any way to disable checking for a GPU/CUDA entirely and default to CPU?
EDIT: I have read the year-old answer regarding tensorflow-gpu==1.0 on Linux posted here, which suggests this is impossible. I'm interested to know if this is still how tensorflow-gpu is compiled, 9 versions later.
In development, I have been using the gpu-accelerated tensorflow
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl
I am attempting to deploy my trained model along with an application binary for my users. I compile using PyInstaller (3.3.dev0+f0df2d2bb) on python 3.5.2 to create my application into a binary for my users.
For deployment, I install the cpu version, https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp35-cp35m-linux_x86_64.whl
However, upon successful compilation, I run my program and receive the infamous tensorflow cuda error:
tensorflow.python.framework.errors_impl.NotFoundError:
tensorflow/contrib/util/tensorflow/contrib/cudnn_rnn/python/ops/_cudnn_rnn_ops.so:
cannot open shared object file: No such file or directory
why is it looking for cuda when I've only got the cpu version installed? (Let alone the fact that I'm still on my development machine with cuda, so it should find it anyway. I can use tensorflow-gpu/cuda fine in uncompiled scripts. But this is irrelevant because deployment machines won't have cuda)
My first thought was that somehow I'm importing the wrong tensorflow, but I've not only used pip uninstall tensorflow-gpu but then I also went to delete the tensorflow-gpu in /usr/local/lib/python3.5/dist-packages/
Any ideas what could be happening? Maybe I need to start using a virtual-env..
I'm trying to run a tensorflow python script in a google cloud vm instance with GPU enabled. I have followed the process for installing GPU drivers, cuda, cudnn and tensorflow. However whenever I try to run my program (which runs fine in a super computing cluster) I keep getting:
undefined symbol: cudnnCreate
I have added the next to my ~/.bashrc
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/usr/local/cuda-8.0/lib64"
export CUDA_HOME="/usr/local/cuda-8.0"
export PATH="$PATH:/usr/local/cuda-8.0/bin"
but still it does not work and produces the same error
Answering my own question: The issue was not that the library was not installed, the library installed was the wrong version hence it could not find it. In this case it was cudnn 5.0. However even after installing the right version it still didn't work due to incompatibilities between versions of driver, CUDA and cudnn. I solved all this issues by re-installing everything including the driver taking into account tensorflow libraries requisites.