TensorFlow GPU: No Performance increase in HelloWorld code

TensorFlow GPU: No Performance increase in HelloWorld code - python

Background:
I am a Python Developer new to TensorFlow.
System Spec:
i5-7200U CPU # 2.50GHz × 4
GeForce 940MX 4GB
Ubuntu 18
I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)
Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!
docker-compose.yml
version: '2.3'
services:
tensorflow:
# image: tensorflow/tensorflow:latest-gpu-py3
image: tensorflow/tensorflow:latest-py3
runtime: nvidia
volumes:
- ./:/notebooks/TensorTest1
ports:
- 8888:8888
When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.
root#e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result: [3. 3. 3. ... 3. 3. 3.]
when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.
root#baf68fc71921:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result: [3. 3. 3. ... 3. 3. 3.]
My Code
import tensorflow as tf
import time
with tf.Session():
start_time = time.time()
input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
output = tf.add(input1, input2)
result = output.eval()
duration = time.time() - start_time
print("TIME:", duration)
print("result: ", result)
Am I doing something wrong here? Based on prints it seems to be using GPU correctly
Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this

A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.

Related

Tensorflow 2.0 GPU Internal memory out of error

I am trying to run my python script using a remote server's GPU that is shared by other users.
The script throws a memory out of error even before it reaches the model training section..
This is the error that I am getting. The server has 3 GPU's however, I am only using a single GPU that is not being used by other processes. Hence I have set "CUDA_VISIBLE_DEVICES" to be "0", the GPU not in use.
This is the error that I am getting.
2020-04-15 15:22:01.870082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-15 15:22:01.870161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-15 15:22:02.748227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 15:22:02.748273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-15 15:22:02.748283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-15 15:22:02.749326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 58 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 15:22:02.768792: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:GPU:0
2020-04-15 15:22:03.335483: F ./tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory
Aborted (core dumped)
I don't understand why I am running out of error. There are at least 16GB of RAM in this GPU and the model has not even started training yet.
Appreciate everyone's help

Keras (tensorflow) finds GPU, but only runs on cpu w/ Cuda 10.1

Already many questions are posted about this, but none of them really answers mine or there is a small difference with what I came across.
I'm on ubuntu 18.04 and installed keras following the default instructions with CUDA 10.1 AND tensorflow-gpu.
When running something tensorflow detects I have a GPU, but when I'm checking cpu vs gpu usage, he still only seem to run on cpu. I came across this thread and run that script. It confirms what I was guessing, that he can't use my gpu for some reason:
2019-09-19 21:05:57.730197: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-19 21:05:57.730247: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-19 21:05:57.730281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-19 21:05:57.730303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-09-19 21:05:57.730317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-09-19 21:05:57.922335: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
When listing the devices it says:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 57580461479478464
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6376288845656491190
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 17409275481256463364
physical_device_desc: "device: XLA_CPU device"
]
But halfway the logs, tensorflow outputs this:
2019-09-19 20:44:32.676537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 860M major: 5 minor: 0 memoryClockRate(GHz): 1.0195
pciBusID: 0000:01:00.0
./deviceQuery outputs this:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2004 MBytes (2101870592 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
Anyone knows why tensorflow can't find my GPU or how to make it available?
Thanks in advance!

It's because of cuda 10.1. You need to downgrade to cuda 10.0.
Here's a similar solution.

Keras seems to hang after call to fit_generator

I am trying to fit the Keras implementation of the SqueezeDet model to a new dataset. After making the appropriate changes to my config file, I tried to run the train script, but it seems to hang after the call to fit_generator(). As I get the following output:
/anaconda/envs/py35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Number of images: 536
Number of epochs: 100
Number of batches: 53
Batch size: 10
2018-07-04 14:18:49.711606: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-04 14:18:54.080912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 52a9:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-04 14:18:54.080958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 14:18:54.333214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 14:18:54.333270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 14:18:54.333290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 14:18:54.333559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 52a9:00:00.0, compute capability: 3.7)
Learning rate: 0.01
Weights initialized by name from ../main/model/imagenet.h5
Using single GPU
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Epoch 1/100
And then nothing happens even if it leave it alone for a day. The call that it seems to freeze on is:
squeeze.model.fit_generator(train_generator, epochs=EPOCHS, verbose=1,
steps_per_epoch=nbatches_train, callbacks=cb)
Where the parameters are:
train_generator = generator_from_data_path(img_names, gt_names, config=cfg)
EPOCHS = 100
nbatches_train = 53
callbacks = [# TensorBoard object, ReduceLROnPlateau object, ModelCheckpoint object #]
My versions:
Python 3.5.4 :: Anaconda custom (64-bit)
tensorflow-gpu : 1.8.0
tensorflow : 1.8.0
Keras : 2.2.0

Formatting conversation in comments to answer.
The culprit was train_generator.
I have looked into sources of model.fit_generator in Keras some time ago. It just retrieves some data from the generator and submits it to the backend, nothing magical :)
So, my hypothesis was that it cannot retrieve data from the generator because the generator does not generate anything.
#Barker has confirmed it, stating that call to next(train_generator) hangs.
I personally have moved to keras.utils.Sequence that supports indexing and length and is much more convenient than ordinary generators. Though this note is not related to the current problem.

TensorFlow not running on GPU(keras with TF backend)

I have followed the steps on installing tensorflow with GPU support and have made sure that the machine I'm using has A GPU thats compatible but it still seems that TensorFlow isn't running properly on my machine. I have a program that trains a keras sequential model(with python 2.7) on a large amount of data using a TensorFlow back end and the output while training is the following:
2018-04-17 00:35:13.837040: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:35:14.042784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:35:14.043143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-17 00:35:14.043186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-17 00:35:16.374355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:35:16.374397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-17 00:35:16.374405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-17 00:35:16.380956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
I don't really understand what these logs mean, however, I ran this job simultaneously on a device that just has a CPU and the time it took to complete the training jobs were identical. Can anyone help tell me how to make my training job run on a GPU? Thanks in advance!

You might consider trying to specify a GPU to run your program, which is a simple piece of code.
with tf.device('/gpu:1'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)
If not, I recommend using anaconda3 to create a TensorFlow-GPU virtual environment, which generally defaults to the GPU version.

OOM when allocating tensor with shape[1,144,144,144,128]

The computer I use is 1080 which has a GPU memory of 8 GB, the memory of my computer is 32 GB, but the array data might be to large for me to restore, the computer tells me resource exhausted. if there is anyway to solve this problem or evaluate the GPU memory i need for such a large numpy array so I can buy a better computer for calculate.by the way the batch_size I use is 1 so i have reduce the memory to the minimal, or i should consider to reduce the raw column and the height of my numpy array, and I think that would effect the resolution of my results that would be okay.
If anyone can answer my question. thanks

The tensor itself you are using is big, but not that big for a 8Gb GPU. 144 * 144 * 144 * 128 is ~380 million, so even with 32-bit items it requires 1.5GiB. I have a GeForce GTX 1070 with 8Gb (same size as you) and here's my Tensorflow experiment:
import numpy as np
import tensorflow as tf
X = tf.placeholder(dtype=tf.int32, shape=(1, 144, 144, 144, 128))
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
value = session.run([X], feed_dict={X: np.zeros(shape=(1, 144, 144, 144, 128))})
print np.array(value).shape
The output:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7465
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 4.14GiB
2017-08-17 20:05:54.312424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-08-17 20:05:54.312430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-08-17 20:05:54.312444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
(1, 1, 144, 144, 144, 128)
Note that free memory is much lower than 8Gb, because I use 2 UHD monitors. So this might be the first cause in your case: other processes can consume a lot of GPU resources.
Next, you didn't provide your Neural Network architecture, but if you are using, for instance, Deep Convolutional Neural Networks, note that the first layers are consuming a lot of memory for parameters and gradients. You might want to read this helpful page for details. If this is the case, you might need to plug in another GPU and split the graph across all available GPUs (here's how you can do it). There are 12Gb memory GPUs available from NVidia.
Finally, you can always consider reducing the floating precision tf.float64 -> tf.float32 -> tf.float16 for all your variables. This can save 8x memory, which sometimes is just enough to run on a GPU.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow GPU: No Performance increase in HelloWorld code - python

Related

Tensorflow 2.0 GPU Internal memory out of error

Keras (tensorflow) finds GPU, but only runs on cpu w/ Cuda 10.1

Keras seems to hang after call to fit_generator

TensorFlow not running on GPU(keras with TF backend)

OOM when allocating tensor with shape[1,144,144,144,128]

Categories

Resources