Tensorflow 2.0 GPU Internal memory out of error

Tensorflow 2.0 GPU Internal memory out of error - python

I am trying to run my python script using a remote server's GPU that is shared by other users.
The script throws a memory out of error even before it reaches the model training section..
This is the error that I am getting. The server has 3 GPU's however, I am only using a single GPU that is not being used by other processes. Hence I have set "CUDA_VISIBLE_DEVICES" to be "0", the GPU not in use.
This is the error that I am getting.
2020-04-15 15:22:01.870082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-15 15:22:01.870161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-15 15:22:02.748227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 15:22:02.748273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-15 15:22:02.748283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-15 15:22:02.749326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 58 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 15:22:02.768792: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:GPU:0
2020-04-15 15:22:03.335483: F ./tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory
Aborted (core dumped)
I don't understand why I am running out of error. There are at least 16GB of RAM in this GPU and the model has not even started training yet.
Appreciate everyone's help

Related

How do I convert a tensorflow model into a TensorRT optimized model using trt.TrtGraphConverterV2 (or other suggestion)?

I am stuck with a problem regarding TensorRT and Tensorflow.
I am using a NVIDIA jetson nano and I try to convert simple Tensorflow models into TensorRT optimized models.
I am using tensorflow 2.1.0 and python 3.6.9.
I try to use utilize t.his code sample from the NVIDIA-guide:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverterV2(input_saved_model_dir=input_saved_model_dir)
converter.convert()
converter.save(output_saved_model_dir)
To test this, I took a simple example from the tensorflow website . To convert the model into an TensorRT-model, I save the model as a "savedModel" and the loaded it into the trt.TrtGraphConverterV2-function:
#https://www.tensorflow.org/tutorials/quickstart/beginner
import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import os
#mnist = tf.keras.datasets.mnist
#(x_train, y_train), (x_test, y_test) = mnist.load_data()
#x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
#tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
# create paths to save models
model_name = "simpleModel"
pb_model = os.path.join(os.path.dirname(os.path.abspath(__file__)),(model_name+"_pb"))
trt_model = os.path.join(os.path.dirname(os.path.abspath(__file__)),(model_name+"_trt"))
if not os.path.exists(pb_model):
os.mkdir(pb_model)
if not os.path.exists(trt_model):
os.mkdir(trt_model)
tf.saved_model.save(model, pb_model)
# https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#usage-example
print("\nconverting to trt-model")
converter = trt.TrtGraphConverterV2(input_saved_model_dir=pb_model )
print("\nconverter.convert")
converter.convert()
print("\nconverter.save")
converter.save(trt_model)
print("trt-model saved under: ",trt_model)
When I run this code it saves the trt-optimized model,but the model cannot be used. When I load the model and try model.summary() for example it tells me:
Traceback (most recent call last):
File "/home/al/Code/Benchmark_70x70/test-load-pb.py", line 45, in <module>
model.summary()
AttributeError: '_UserObject' object has no attribute 'summary'
This is the complete output of the converter script:
2020-04-01 20:38:07.395780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:11.837436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-01 20:38:11.879775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-01 20:38:17.015440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-01 20:38:17.054065: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.061718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 23.84GiB/s
2020-04-01 20:38:17.061853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:17.061989: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-01 20:38:17.145546: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-01 20:38:17.252192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-01 20:38:17.368195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-01 20:38:17.433245: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-01 20:38:17.433451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-01 20:38:17.433761: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.434112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.434418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-01 20:38:17.483529: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-04-01 20:38:17.504302: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13e7b0f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-01 20:38:17.504407: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-01 20:38:17.713898: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.714293: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13de1210 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-01 20:38:17.714758: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2020-04-01 20:38:17.715405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.715650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 23.84GiB/s
2020-04-01 20:38:17.715796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:17.715941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-01 20:38:17.716057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-01 20:38:17.716174: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-01 20:38:17.716252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-01 20:38:17.716311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-01 20:38:17.716418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-01 20:38:17.716687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.716994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:17.717111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-01 20:38:17.736625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:30.190208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-01 20:38:30.315240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-01 20:38:30.315482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-01 20:38:30.832895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:31.002925: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:31.005861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2020-04-01 20:38:34.803674: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
converting to trt-model
2020-04-01 20:38:37.808143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
converter.convert
2020-04-01 20:38:39.618691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:39.618842: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-04-01 20:38:39.619224: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-04-01 20:38:39.712117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:39.712437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 23.84GiB/s
2020-04-01 20:38:39.712594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:39.744930: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-01 20:38:40.056630: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-01 20:38:40.153461: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-01 20:38:40.176047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-01 20:38:40.214052: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-01 20:38:40.231552: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-01 20:38:40.231927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.232253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.232388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-01 20:38:40.232538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-01 20:38:40.232587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-01 20:38:40.232618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-01 20:38:40.232890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.233546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.233761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2020-04-01 20:38:40.579950: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:841] Optimization results for grappler item: graph_to_optimize
2020-04-01 20:38:40.580104: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] function_optimizer: Graph size after: 26 nodes (19), 43 edges (36), time = 179.825ms.
2020-04-01 20:38:40.580157: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] function_optimizer: function_optimizer did nothing. time = 0.152ms.
2020-04-01 20:38:40.941994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.942217: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-04-01 20:38:40.942412: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-04-01 20:38:40.943756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.943916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 23.84GiB/s
2020-04-01 20:38:40.944010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-01 20:38:40.944073: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-01 20:38:40.944148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-01 20:38:40.944209: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-01 20:38:40.944266: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-01 20:38:40.944320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-01 20:38:40.944372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-01 20:38:40.944572: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.944816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.944911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-01 20:38:40.944993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-01 20:38:40.945031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-01 20:38:40.945059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-01 20:38:40.945283: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.945569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-04-01 20:38:40.945714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2020-04-01 20:38:41.037807: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 6 ops of 3 different types in the graph that are not converted to TensorRT: Identity, NoOp, Placeholder, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).
2020-04-01 20:38:41.043736: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:636] Number of TensorRT candidate segments: 1
2020-04-01 20:38:41.046312: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:737] Replaced segment 0 consisting of 12 nodes by TRTEngineOp_0.
2020-04-01 20:38:41.073078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:841] Optimization results for grappler item: tf_graph
2020-04-01 20:38:41.073159: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 22 nodes (-4), 35 edges (-8), time = 14.454ms.
2020-04-01 20:38:41.073188: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] layout: Graph size after: 22 nodes (0), 35 edges (0), time = 20.565ms.
2020-04-01 20:38:41.073214: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 22 nodes (0), 35 edges (0), time = 5.644ms.
2020-04-01 20:38:41.073238: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] TensorRTOptimizer: Graph size after: 11 nodes (-11), 14 edges (-21), time = 28.58ms.
2020-04-01 20:38:41.073265: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 11 nodes (0), 14 edges (0), time = 2.904ms.
2020-04-01 20:38:41.073289: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:841] Optimization results for grappler item: TRTEngineOp_0_native_segment
2020-04-01 20:38:41.073312: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 14 nodes (0), 15 edges (0), time = 2.875ms.
2020-04-01 20:38:41.073335: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] layout: Graph size after: 14 nodes (0), 15 edges (0), time = 2.389ms.
2020-04-01 20:38:41.073358: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 14 nodes (0), 15 edges (0), time = 2.834ms.
2020-04-01 20:38:41.073382: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] TensorRTOptimizer: Graph size after: 14 nodes (0), 15 edges (0), time = 0.218ms.
2020-04-01 20:38:41.073405: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:843] constant_folding: Graph size after: 14 nodes (0), 15 edges (0), time = 5.268ms.
converter.save
2020-04-01 20:38:46.730260: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at trt_engine_resource_ops.cc:183 : Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_0)
trt-model saved under: /home/al/Code/Benchmark_70x70/simpleModel_trt

Thank you very much for the response. It contains everything I need.
To test the converter script, I ran the code in colab and it worked fine, so I guess I need to check my environment for errors.
Regarding the model.summary() issue:
As you pointed out correctly,it seems like methods from the Keras API are removed when converting the model. I especially needed the model.predict() method to use the new model for prediction. Luckily there are other ways to run inference. Additionaly to the one you posted, I found the one described in this tutorial and used it.
I summarized the whole example and explanations in this notebook
loaded = tf.saved_model.load('./model_trt') # loading the converted model
print("The signature keys are: ",list(loaded.signatures.keys()))
infer = loaded.signatures["serving_default"]
im_select = 0 # choose train-image you want to classify
labeling = infer(tf.constant(train_images[im_select],dtype=float))['LastLayer'] ## Here, the Image classification happens; we need the name of the last layer we defined in the beginning
#Display result
print("Image ",im_select," is classified as a ",class_names[int(tf.argmax(labeling,axis=1))] )
plt.imshow(train_images[im_select])

It seems that the conversion has been successful,
I have tried using both the .pb files from Keras & TensorRT.
Below is the sample code
saved_model_loaded = tf.saved_model.load(
'path to trt converted model') # path to keras .pb or TensorRT .pb
#for layer in saved_model_loaded.keras_api.layers:
graph_func = saved_model_loaded.signatures['serving_default']
frozen_func = convert_variables_to_constants_v2(
graph_func)
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
#convert to tensors
input_tensors = tf.cast(x_test, dtype=tf.float32)
output = frozen_func(input_tensors[:1])[0].numpy()
print(output)
Note: I have tried both of the model from keras & TensorRT and the result is the same.
Regarding the model.summary() Error, It seems that once the model is converted, it removes some of the methods like .summary()
But you can use Tensorboard as an alternative if you want to check the graph from tensorRT converted model
Below is the sample code
import argparse
import sys
import tensorflow as tf
%load_ext tensorboard
from tensorflow.python.platform import app
from tensorflow.python.summary import summary
def import_to_tensorboard(model_dir, log_dir):
"""View an imported protobuf model (`.pb` file) as a graph in Tensorboard.
Args:
model_dir: The location of the protobuf (`pb`) model to visualize
log_dir: The location for the Tensorboard log to begin visualization from.
Usage:
Call this function with your model location and desired log directory.
Launch Tensorboard by pointing it to the log directory.
View your imported `.pb` model as a graph.
"""
with tf.compat.v1.Session(graph=tf.Graph()) as sess:
tf.compat.v1.saved_model.loader.load(
sess, [tf.compat.v1.saved_model.tag_constants.SERVING], model_dir)
pb_visual_writer = summary.FileWriter(log_dir)
pb_visual_writer.add_graph(sess.graph)
print("Model Imported. Visualize by running: "
"tensorboard --logdir={}".format(log_dir))
Call the function
import_to_tensorboard('path to trt model', '/logs/')
Open the Tensorboard
%tensorboard --logdir='path to logs'
Let me know if this help.

'''
steps to convert tensorflow model to tensor RT model
Load the model (. h5 or. hdf5) using model.load_weights(.h5_file_dir)
Save the model using tf.saved_model.save(your_model, destn_dir)
It will save the model in .pb format with assets and variables folder, keep those as it is.
Use the Linux machine to convert .pb model to tensorRT
while converting remember just give path for the folder where the pb file and other folders(assets and variables) exists.
then start converting.
'''

Keras (tensorflow) finds GPU, but only runs on cpu w/ Cuda 10.1

Already many questions are posted about this, but none of them really answers mine or there is a small difference with what I came across.
I'm on ubuntu 18.04 and installed keras following the default instructions with CUDA 10.1 AND tensorflow-gpu.
When running something tensorflow detects I have a GPU, but when I'm checking cpu vs gpu usage, he still only seem to run on cpu. I came across this thread and run that script. It confirms what I was guessing, that he can't use my gpu for some reason:
2019-09-19 21:05:57.730197: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-19 21:05:57.730247: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-19 21:05:57.730281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-19 21:05:57.730303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-09-19 21:05:57.730317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-09-19 21:05:57.922335: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
When listing the devices it says:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 57580461479478464
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6376288845656491190
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 17409275481256463364
physical_device_desc: "device: XLA_CPU device"
]
But halfway the logs, tensorflow outputs this:
2019-09-19 20:44:32.676537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 860M major: 5 minor: 0 memoryClockRate(GHz): 1.0195
pciBusID: 0000:01:00.0
./deviceQuery outputs this:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2004 MBytes (2101870592 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
Anyone knows why tensorflow can't find my GPU or how to make it available?
Thanks in advance!

It's because of cuda 10.1. You need to downgrade to cuda 10.0.
Here's a similar solution.

TensorFlow GPU: No Performance increase in HelloWorld code

Background:
I am a Python Developer new to TensorFlow.
System Spec:
i5-7200U CPU # 2.50GHz × 4
GeForce 940MX 4GB
Ubuntu 18
I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)
Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!
docker-compose.yml
version: '2.3'
services:
tensorflow:
# image: tensorflow/tensorflow:latest-gpu-py3
image: tensorflow/tensorflow:latest-py3
runtime: nvidia
volumes:
- ./:/notebooks/TensorTest1
ports:
- 8888:8888
When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.
root#e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result: [3. 3. 3. ... 3. 3. 3.]
when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.
root#baf68fc71921:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result: [3. 3. 3. ... 3. 3. 3.]
My Code
import tensorflow as tf
import time
with tf.Session():
start_time = time.time()
input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
output = tf.add(input1, input2)
result = output.eval()
duration = time.time() - start_time
print("TIME:", duration)
print("result: ", result)
Am I doing something wrong here? Based on prints it seems to be using GPU correctly
Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this

A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.

TensorFlow not running on GPU(keras with TF backend)

I have followed the steps on installing tensorflow with GPU support and have made sure that the machine I'm using has A GPU thats compatible but it still seems that TensorFlow isn't running properly on my machine. I have a program that trains a keras sequential model(with python 2.7) on a large amount of data using a TensorFlow back end and the output while training is the following:
2018-04-17 00:35:13.837040: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:35:14.042784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:35:14.043143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-17 00:35:14.043186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-17 00:35:16.374355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:35:16.374397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-17 00:35:16.374405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-17 00:35:16.380956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
I don't really understand what these logs mean, however, I ran this job simultaneously on a device that just has a CPU and the time it took to complete the training jobs were identical. Can anyone help tell me how to make my training job run on a GPU? Thanks in advance!

You might consider trying to specify a GPU to run your program, which is a simple piece of code.
with tf.device('/gpu:1'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)
If not, I recommend using anaconda3 to create a TensorFlow-GPU virtual environment, which generally defaults to the GPU version.

How to interpret TensorFlow output?

How do I interpret the TensorFlow output for building and executing computational graphs on GPGPUs?
Given the following command that executes an arbitrary tensorflow script using the python API.
python3 tensorflow_test.py > out
The first part stream_executor seems like its loading dependencies.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
What is a NUMA node?
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I assume this is when it finds the available GPU
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
Some gpu initialization? what is DMA?
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
Why does it throw an error E?
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Great answer to what the pool_allocator does: https://stackoverflow.com/a/35166985/4233809
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863

About NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa
Roughly speaking, if you have dual-socket CPU, they will each have their own memory and have to access the other processor's memory through a slower QPI link. So each CPU+memory is a NUMA node.
Potentially you could treat two different NUMA nodes as two different devices and structure your network to optimize for different within-node/between-node bandwidth
However, I don't think there's enough wiring in TF right now to do this right now. The detection doesn't work either -- I just tried on a machine with 2 NUMA nodes, and it still printed the same message and initialized to 1 NUMA node.
DMA = Direct Memory Access. You could potentially copy things from one GPU to another GPU without utilizing CPU (ie, through NVlink). NVLink integration isn't there yet.
As far as the error, TensorFlow tries to allocate close to GPU max memory so it sounds like some of your GPU memory is already been allocated to something else and the allocation failed.
You can do something like below to avoid allocating so much memory
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000 # terminate on long hangs
sess = tf.InteractiveSession("", config=config)

successfully opened CUDA library xxx locally means that the library was loaded, but it does not meant that it will be used.
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero means that your kernel does not have NUMA support. You can read about NUMA here and here.
Found device 0 with properties: you have 1 GPU which you can use. It lists the properties of this GPU.
DMA is direct memory access. More information on Wikipedia.
failed to allocate 11.15G the error clearly explains why this happened, but it is hard to tell why do you need so much memory without looking at the code.
pool allocator messages are explained in this answer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow 2.0 GPU Internal memory out of error - python

Related

How do I convert a tensorflow model into a TensorRT optimized model using trt.TrtGraphConverterV2 (or other suggestion)?

Keras (tensorflow) finds GPU, but only runs on cpu w/ Cuda 10.1

TensorFlow GPU: No Performance increase in HelloWorld code

TensorFlow not running on GPU(keras with TF backend)

How to interpret TensorFlow output?

Categories

Resources