I have followed the steps on installing tensorflow with GPU support and have made sure that the machine I'm using has A GPU thats compatible but it still seems that TensorFlow isn't running properly on my machine. I have a program that trains a keras sequential model(with python 2.7) on a large amount of data using a TensorFlow back end and the output while training is the following:
2018-04-17 00:35:13.837040: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:35:14.042784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:35:14.043143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-17 00:35:14.043186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-17 00:35:16.374355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:35:16.374397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-17 00:35:16.374405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-17 00:35:16.380956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
I don't really understand what these logs mean, however, I ran this job simultaneously on a device that just has a CPU and the time it took to complete the training jobs were identical. Can anyone help tell me how to make my training job run on a GPU? Thanks in advance!
You might consider trying to specify a GPU to run your program, which is a simple piece of code.
with tf.device('/gpu:1'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)
If not, I recommend using anaconda3 to create a TensorFlow-GPU virtual environment, which generally defaults to the GPU version.
Related
I am trying to run my python script using a remote server's GPU that is shared by other users.
The script throws a memory out of error even before it reaches the model training section..
This is the error that I am getting. The server has 3 GPU's however, I am only using a single GPU that is not being used by other processes. Hence I have set "CUDA_VISIBLE_DEVICES" to be "0", the GPU not in use.
This is the error that I am getting.
2020-04-15 15:22:01.870082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-15 15:22:01.870161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-15 15:22:02.748227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 15:22:02.748273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-15 15:22:02.748283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-15 15:22:02.749326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 58 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 15:22:02.768792: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:GPU:0
2020-04-15 15:22:03.335483: F ./tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory
Aborted (core dumped)
I don't understand why I am running out of error. There are at least 16GB of RAM in this GPU and the model has not even started training yet.
Appreciate everyone's help
Already many questions are posted about this, but none of them really answers mine or there is a small difference with what I came across.
I'm on ubuntu 18.04 and installed keras following the default instructions with CUDA 10.1 AND tensorflow-gpu.
When running something tensorflow detects I have a GPU, but when I'm checking cpu vs gpu usage, he still only seem to run on cpu. I came across this thread and run that script. It confirms what I was guessing, that he can't use my gpu for some reason:
2019-09-19 21:05:57.730197: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-19 21:05:57.730247: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-19 21:05:57.730281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-19 21:05:57.730303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-09-19 21:05:57.730317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-09-19 21:05:57.922335: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
When listing the devices it says:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 57580461479478464
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6376288845656491190
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 17409275481256463364
physical_device_desc: "device: XLA_CPU device"
]
But halfway the logs, tensorflow outputs this:
2019-09-19 20:44:32.676537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 860M major: 5 minor: 0 memoryClockRate(GHz): 1.0195
pciBusID: 0000:01:00.0
./deviceQuery outputs this:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2004 MBytes (2101870592 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
Anyone knows why tensorflow can't find my GPU or how to make it available?
Thanks in advance!
It's because of cuda 10.1. You need to downgrade to cuda 10.0.
Here's a similar solution.
Background:
I am a Python Developer new to TensorFlow.
System Spec:
i5-7200U CPU # 2.50GHz × 4
GeForce 940MX 4GB
Ubuntu 18
I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)
Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!
docker-compose.yml
version: '2.3'
services:
tensorflow:
# image: tensorflow/tensorflow:latest-gpu-py3
image: tensorflow/tensorflow:latest-py3
runtime: nvidia
volumes:
- ./:/notebooks/TensorTest1
ports:
- 8888:8888
When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.
root#e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result: [3. 3. 3. ... 3. 3. 3.]
when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.
root#baf68fc71921:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result: [3. 3. 3. ... 3. 3. 3.]
My Code
import tensorflow as tf
import time
with tf.Session():
start_time = time.time()
input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
output = tf.add(input1, input2)
result = output.eval()
duration = time.time() - start_time
print("TIME:", duration)
print("result: ", result)
Am I doing something wrong here? Based on prints it seems to be using GPU correctly
Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this
A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.
I am trying to fit the Keras implementation of the SqueezeDet model to a new dataset. After making the appropriate changes to my config file, I tried to run the train script, but it seems to hang after the call to fit_generator(). As I get the following output:
/anaconda/envs/py35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Number of images: 536
Number of epochs: 100
Number of batches: 53
Batch size: 10
2018-07-04 14:18:49.711606: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-04 14:18:54.080912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 52a9:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-04 14:18:54.080958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 14:18:54.333214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 14:18:54.333270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 14:18:54.333290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 14:18:54.333559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 52a9:00:00.0, compute capability: 3.7)
Learning rate: 0.01
Weights initialized by name from ../main/model/imagenet.h5
Using single GPU
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Epoch 1/100
And then nothing happens even if it leave it alone for a day. The call that it seems to freeze on is:
squeeze.model.fit_generator(train_generator, epochs=EPOCHS, verbose=1,
steps_per_epoch=nbatches_train, callbacks=cb)
Where the parameters are:
train_generator = generator_from_data_path(img_names, gt_names, config=cfg)
EPOCHS = 100
nbatches_train = 53
callbacks = [# TensorBoard object, ReduceLROnPlateau object, ModelCheckpoint object #]
My versions:
Python 3.5.4 :: Anaconda custom (64-bit)
tensorflow-gpu : 1.8.0
tensorflow : 1.8.0
Keras : 2.2.0
Formatting conversation in comments to answer.
The culprit was train_generator.
I have looked into sources of model.fit_generator in Keras some time ago. It just retrieves some data from the generator and submits it to the backend, nothing magical :)
So, my hypothesis was that it cannot retrieve data from the generator because the generator does not generate anything.
#Barker has confirmed it, stating that call to next(train_generator) hangs.
I personally have moved to keras.utils.Sequence that supports indexing and length and is much more convenient than ordinary generators. Though this note is not related to the current problem.
How do I interpret the TensorFlow output for building and executing computational graphs on GPGPUs?
Given the following command that executes an arbitrary tensorflow script using the python API.
python3 tensorflow_test.py > out
The first part stream_executor seems like its loading dependencies.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
What is a NUMA node?
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I assume this is when it finds the available GPU
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
Some gpu initialization? what is DMA?
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
Why does it throw an error E?
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Great answer to what the pool_allocator does: https://stackoverflow.com/a/35166985/4233809
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863
About NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa
Roughly speaking, if you have dual-socket CPU, they will each have their own memory and have to access the other processor's memory through a slower QPI link. So each CPU+memory is a NUMA node.
Potentially you could treat two different NUMA nodes as two different devices and structure your network to optimize for different within-node/between-node bandwidth
However, I don't think there's enough wiring in TF right now to do this right now. The detection doesn't work either -- I just tried on a machine with 2 NUMA nodes, and it still printed the same message and initialized to 1 NUMA node.
DMA = Direct Memory Access. You could potentially copy things from one GPU to another GPU without utilizing CPU (ie, through NVlink). NVLink integration isn't there yet.
As far as the error, TensorFlow tries to allocate close to GPU max memory so it sounds like some of your GPU memory is already been allocated to something else and the allocation failed.
You can do something like below to avoid allocating so much memory
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000 # terminate on long hangs
sess = tf.InteractiveSession("", config=config)
successfully opened CUDA library xxx locally means that the library was loaded, but it does not meant that it will be used.
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero means that your kernel does not have NUMA support. You can read about NUMA here and here.
Found device 0 with properties: you have 1 GPU which you can use. It lists the properties of this GPU.
DMA is direct memory access. More information on Wikipedia.
failed to allocate 11.15G the error clearly explains why this happened, but it is hard to tell why do you need so much memory without looking at the code.
pool allocator messages are explained in this answer