How to debug Tensorflow segmentation fault in model.fit()?

How to debug Tensorflow segmentation fault in model.fit()? - python

I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system.
I am running the unmodified example from a command line python session. I get the following output:
Using TensorFlow backend.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
conv2d_1/random_uniform/RandomUniform: (RandomUniform):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/sub: (Sub):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/mul: (Mul):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform: (Add):
/job:localhost/replica:0/task:0/device:GPU:0
[...]
The last lines I receive are:
training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Segmentation fault (core dumped)
From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory:
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3
K.tensorflow_backend.set_session(tf.Session(config=config))
Checking with the nvidia-smi tool that the GPU is actually used (watch -n1 nvidia-smi)I can confirm from the following output (in this run no per_process_gpu_memory_fraction was set to 1):
I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this.
What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault?
EDIT: I experimented further and replacing the model with this code works fine:
model = keras.Sequential([
keras.layers.Flatten(input_shape=input_shape),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
However once I introduce a convolution layer like so
model = keras.Sequential([
keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
# keras.layers.Flatten(input_shape=input_shape),
keras.layers.Flatten(),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
then I again get the aforementioned segfault.
All packets have been installed through Anaconda. I have installed
conda 4.5.11
python 3.6.6
keras-gpu 2.2.4
tensorflow 1.12.0
tensorflow-gpu 1.12.0
cudnn 7.2.1
cudatoolkit 9.2
EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.

Build the tensorflow from source(r1.13) .Conv2D segmentation fault fixed.
follow Build from Source
my GPU : RTX 2070
Ubuntu 16.04
Python 3.5.2
Nvidia Driver 410.78
CUDA - 10.0.130
cuDNN-10.0 - 7.4.2.24
TensorRT-5.0.0
Compute Capability: 7.5
Build : tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64
Download prebuilt from https://github.com/tensorflow/tensorflow/issues/22706

I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting:
Ubuntu: 18.04
GPU: RTX 2070
CUDA: 10
cudnn: 7
conda with python 3.6
I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link:
https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c
This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps:
insalling bazel
cloning tensorflow from git and running ./configure
running the appropriate bazel build command (see link for details)
Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using:
pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0
which I found out using this answer here:
Error Compiling Tensorflow From Source - No module named 'keras_applications'
conv2d now works like a charm when using the gpu!
However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.

I had the same Conv2D problem with:
Ubuntu 18.04
Graphic card: GeForce RTX 2080
CUDA: cuda_10.0.130_410
CUDNN: cudnn-10.0-linux-x64-v7.4.2
conda with Python 3.6
Best advice was from this link: https://github.com/tensorflow/tensorflow/issues/24383
So a fix should come with Tensorflow 1.13.
In the meantime, using Tensorflow 1.13 nightly build (Dec 26, 2018) + using tensorflow.keras instead of keras solved the issue.

Related

How make tensorflow use GPU?

I'm working with python and I would like to use tensorflow with my GTX2080TI but tensorflow is using only the CPU.
when I ask for device on my computer, it always return an empty list:
In [3]: tf.config.list_physical_devices('GPU')
Out[3]: []
I try this post: How do I use TensorFlow GPU?
but I don't use cuda and the tennsorflow-gpu seems outdated.
I also try this well done tutorial https://www.youtube.com/watch?v=hHWkvEcDBO0 without success.
I install the card drivers, CUDA and cudNN but I still get the same issue.
I also unsinstall tensorflow and keras an install them again without success.
I don't know how can I find what is missing or if I made something wrong.
Python 3.10
Tensoflow version: 2.11.0
Cuda version: 11.2
cudNN: 8.1
This line code tell's me that cuda is not build:
print(tf_build_info.build_info)
OrderedDict([('is_cuda_build', False), ('is_rocm_build', False), ('is_tensorrt_build', False), ('msvcp_dll_names', 'msvcp140.dll,msvcp140_1.dll')])

Starting with TensorFlow 2.11, GPU support was dropped for native windows. You can optionally use Direct ML Plugin
Tutorial here

TensorFlow warning: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA [duplicate]

I have recently installed tensorflow (Windows CPU version) and received the following message:
Successfully installed tensorflow-1.4.0 tensorflow-tensorboard-0.4.0rc2
Then when I tried to run
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
sess.run(hello)
'Hello, TensorFlow!'
a = tf.constant(10)
b = tf.constant(32)
sess.run(a + b)
42
sess.close()
(which I found through https://github.com/tensorflow/tensorflow)
I received the following message:
2017-11-02 01:56:21.698935: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
But when I ran
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
it ran as it should and output Hello, TensorFlow!, which indicates that the installation was successful indeed but there is something else that is wrong.
Do you know what the problem is and how to fix it?

What is this warning about?
Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:
Advanced Vector Extensions (AVX) are extensions to the x86 instruction
set architecture for microprocessors from Intel and AMD proposed by
Intel in March 2008 and first supported by Intel with the Sandy
Bridge processor shipping in Q1 2011 and later on by AMD with the
Bulldozer processor shipping in Q3 2011. AVX provides new features,
new instructions and a new coding scheme.
In particular, AVX introduces fused multiply-accumulate (FMA) operations, which speed up linear algebra computation, namely dot-product, matrix multiply, convolution, etc. Almost every machine-learning training involves a great deal of these operations, hence will be faster on a CPU that supports AVX and FMA (up to 300%). The warning states that your CPU does support AVX (hooray!).
I'd like to stress here: it's all about CPU only.
Why isn't it used then?
Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.
What should you do?
If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by
# Just disables the warning, doesn't take advantage of AVX/FMA to run faster
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
... or by setting export TF_CPP_MIN_LOG_LEVEL=2 if you're on Unix. Tensorflow is working fine anyway, but you won't see these annoying warnings.
If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.

Update the tensorflow binary for your CPU & OS using this command
pip install --ignore-installed --upgrade "Download URL"
The download url of the whl file can be found here
https://github.com/lakshayg/tensorflow-build

CPU optimization with GPU
There are performance gains you can get by installing TensorFlow from the source even if you have a GPU and use it for training and inference. The reason is that some TF operations only have CPU implementation and cannot run on your GPU.
Also, there are some performance enhancement tips that makes good use of your CPU. TensorFlow's performance guide recommends the following:
Placing input pipeline operations on the CPU can significantly improve performance. Utilizing the CPU for the input pipeline frees the GPU to focus on training.
For best performance, you should write your code to utilize your CPU and GPU to work in tandem, and not dump it all on your GPU if you have one.
Having your TensorFlow binaries optimized for your CPU could pay off hours of saved running time and you have to do it once.

For Windows, you can check the official Intel MKL optimization for TensorFlow wheels that are compiled with AVX2. This solution speeds up my inference ~x3.
conda install tensorflow-mkl

For Windows (Thanks to the owner f040225), go to here: https://github.com/fo40225/tensorflow-windows-wheel to fetch the url for your environment based on the combination of "tf + python + cpu_instruction_extension". Then use this cmd to install:
pip install --ignore-installed --upgrade "URL"
If you encounter the "File is not a zip file" error, download the .whl to your local computer, and use this cmd to install:
pip install --ignore-installed --upgrade /path/target.whl

If you use the pip version of TensorFlow, it means it's already compiled and you are just installing it. Basically you install TensorFlow-GPU, but when you download it from the repository and trying to build it, you should build it with CPU AVX support. If you ignore it, you will get a warning every time when you run on the CPU. You can take a look at those too.
Proper way to compile Tensorflow with SSE4.2 and AVX
What is AVX Cpu support in tensorflow

The easiest way that I found to fix this is to uninstall everything then install a specific version of tensorflow-gpu:
uninstall tensorflow:
pip uninstall tensorflow
uninstall tensorflow-gpu: (make sure to run this even if you are not sure if you installed it)
pip uninstall tensorflow-gpu
Install specific tensorflow-gpu version:
pip install tensorflow-gpu==2.0.0
pip install tensorflow_hub
pip install tensorflow_datasets
You can check if this worked by adding the following code into a python file:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub Version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
Run the file and then the output should be something like this:
Version: 2.0.0
Eager mode: True
Hub Version: 0.7.0
GPU is available
Hope this helps

What worked for me tho is this library https://pypi.org/project/silence-tensorflow/
Install this library and do as instructed on the page, it works like a charm!

Try using anaconda. I had the same error. One lone option was to build tensorflow from source which took long time. I tried using conda and it worked.
Create a new environment in anaconda.
conda install -c conda-forge tensorflow
Then, it worked.

He provided once the list, deleted by someone but see the answer is
Download packages list
Output:
F:\temp\Python>python test_tf_logics_.py
[0, 0, 26, 12, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]
[ 0 0 0 26 12 0 0 0 2 0 0 0 0 0 0 0 0]
[ 0 0 26 12 0 0 0 2 0 0 0 0 0 0 0 0 0]
2022-03-23 15:47:05.516025: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-23 15:47:06.161476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
[0 0 2 0 0 0 0 7 0 0 0 0 0 0 0 0 0]
...

As the message says, your CPU supports instructions that TensorFlow binary was not compiled to use. This should not be an issue with CPU version of TensorFlow since it does not perform AVX (Advanced Vector Extensions) instructions.
However, it seems that TensorFlow uses AVX instructions in some parts of the code and the message is just a warning, you can safely ignore it.
You can compile your own version of TensorFlow with AVX instructions.

CPU usage is more than GPU usage on Keras, Tensorflow

I wanted to teach an image classification CNN, and use Keras for it.
The image dimensions are 300x300x3.
I have trained a CNN with 2M parameters, I used MobileNet of Keras for transfer learning, however I freeze last 63 layers and add dense layers at the bottom, the last layer has 2 unit and Softmax activation.
To make predictions, I load the h5 file and use OpenCV video capture to get video frames, for each frame I use model.predict(img_array).
When i look to the Task Manager of Windows 10 , I see that the Python script uses %80 of my processor but %2 of GPU. This CPU usage causes Lags on my laptop.
How can I reduce the CPU usage and force Keras to make computations with GPU?
I have Nvidia Rtx 2060 4GB and Intel Core i7-9750H on my laptop.
Tensorflow 2.1 and Keras 2.3.1
OpenCV 4.1
I have tried, but actually nothing changes.
tf.config.threading.set_inter_op_parallelism_threads(12)
tf.config.threading.set_intra_op_parallelism_threads(12)
with tf.device(\gpu:0):
model.predict(img_array)
Best regards.
Edit:
I reduce the CPU usage to %20 with declaring steps parameter in the predict method.

Please check your pip list or conda list.
Sometimes, we mistakenly install both tensorflow and tensorflow-gpu.
If you have both, the system will automatically go for tensorflow, which is the CPU one.
If that is the case, DELETE "tensorflow", keeping only "tensorflow-gpu".
If you do not see tensorflow-gpu in the first place, try installing it on conda using the following commands:
conda create -n [EnvironmentName] python=3.6
conda activate [EnvironmentName]
conda install -c conda-forge tensorflow-gpu==1.14
it will assess which version (CUDA,CUDNN, etc.) you require and download and install it directly to your environment. Then run your python file from this environment. Good luck ^_^

pixel-cnn (tensorflow-gpu) not recognising GPU

I'm trying to run the pixel-cnn neural network available on github. Following the instructions in README.md I run the following code in cmd:
train.py -i ./data_dir/ -o ./save_dir -g 1
I'm using one gpu and created the two folders ./data_dir and ./save_dir within the same directory as train.py for loading & saving the data. When doing so I get the following error message:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation model_1/ones: node model_1/ones (defined at \OneDrive - MNG\Matura Arbeit\Projects\pixel-cnn-master\pixel_cnn_pp\model.py:36) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. The requested device appears to be a GPU, but CUDA is not enabled.
It seems that the tensorflow doesn't recognise the GPU but when checking the devices available to tensorflow (as described here) both my CPU and GPU show up as "/device:CPU:0" and /device:GPU:0". Also, when running other programs with tensorflow-gpu it work perfectly fine.
I have installed tensorflow-gpu==1.14.0. As for the CUDA I'm pretty sure I have installed version 10.0, as shown by nvcc --version. Although when running nvidia-smi it shows that CUDA version 10.1 is installed.
(edited:)I am using an Anaconda evironment (Windows 10) with tensorflow-gpu==1.14.0. The GPU I'm using is a GTX 1050Ti with Max-Q Design and driver version 436.30. As for CUDA I'm pretty sure I have installed version 10.0, as shown by nvcc --version. Although when running nvidia-smi it shows that CUDA version 10.1 is installed.

Can I run Keras model on gpu?

I'm running a Keras model, with a submission deadline of 36 hours, if I train my model on the cpu it will take approx 50 hours, is there a way to run Keras on gpu?
I'm using Tensorflow backend and running it on my Jupyter notebook, without anaconda installed.

Yes you can run keras models on GPU. Few things you will have to check first.
your system has GPU (Nvidia. As AMD doesn't work yet)
You have installed the GPU version of tensorflow
You have installed CUDA installation instructions
Verify that tensorflow is running with GPU check if GPU is working
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
for TF > v2.0
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
(Thanks #nbro and #Ferro for pointing this out in the comments)
OR
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
output will be something like this:
[
name: "/cpu:0"device_type: "CPU",
name: "/gpu:0"device_type: "GPU"
]
Once all this is done your model will run on GPU:
To Check if keras(>=2.1.1) is using GPU:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
All the best.

2.0 Compatible Answer: While above mentioned answer explain in detail on how to use GPU on Keras Model, I want to explain how it can be done for Tensorflow Version 2.0.
To know how many GPUs are available, we can use the below code:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
To find out which devices your operations and tensors are assigned to,
put tf.debugging.set_log_device_placement(True) as the first statement of your program.
Enabling device placement logging causes any Tensor allocations or operations to be printed. For example, running the below code:
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
gives the Output shown below:
Executing op MatMul in device
/job:localhost/replica:0/task:0/device:GPU:0 tf.Tensor( [[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
For more information, refer this link

Sure. I suppose that you have already installed TensorFlow for GPU.
You need to add the following block after importing keras. I am working on a machine which have 56 core cpu, and a gpu.
import keras
import tensorflow as tf
config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} )
sess = tf.Session(config=config)
keras.backend.set_session(sess)
Of course, this usage enforces my machines maximum limits. You can decrease cpu and gpu consumption values.

Of course. if you are running on Tensorflow or CNTk backends, your code will run on your GPU devices defaultly.But if Theano backends, you can use following
Theano flags:
"THEANO_FLAGS=device=gpu,floatX=float32 python my_keras_script.py"

I'm using Anaconda on Windows 10, with a GTX 1660 Super. I first installed the CUDA environment following this step-by-step. However there is now a keras-gpu metapackage available on Anaconda which apparently doesn't require installing CUDA and cuDNN libraries beforehand (mine were already installed anyway).
This is what worked for me to create a dedicated environment named keras_gpu:
# need to downgrade from tensorflow 2.1 for my particular setup
conda create --name keras_gpu keras-gpu=2.3.1 tensorflow-gpu=2.0
To add on #johncasey 's answer but for TensorFlow 2.0, adding this block works for me:
import tensorflow as tf
from tensorflow.python.keras import backend as K
# adjust values to your needs
config = tf.compat.v1.ConfigProto( device_count = {'GPU': 1 , 'CPU': 8} )
sess = tf.compat.v1.Session(config=config)
K.set_session(sess)
This post solved the set_session error I got: you need to use the keras backend from the tensorflow path instead of keras itself.

Using Tensorflow 2.5, building on #MonkeyBack's answer:
conda create --name keras_gpu keras-gpu tensorflow-gpu
# should show GPU is available
python -c "import tensorflow as tf;print('GPUs Available:', tf.config.list_physical_devices('GPU'))"

See if your script is running GPU in Task manager. If not, suspect your CUDA version is right one for the tensorflow version you are using, as the other answers suggested already.
Additionally, a proper CUDA DNN library for the CUDA version is required to run GPU with tensorflow. Download/extract it from here and put the DLL (e.g., cudnn64_7.dll) into CUDA bin folder (e.g., C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.