I am trying to fit the Keras implementation of the SqueezeDet model to a new dataset. After making the appropriate changes to my config file, I tried to run the train script, but it seems to hang after the call to fit_generator(). As I get the following output:
/anaconda/envs/py35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Number of images: 536
Number of epochs: 100
Number of batches: 53
Batch size: 10
2018-07-04 14:18:49.711606: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-04 14:18:54.080912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 52a9:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-04 14:18:54.080958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 14:18:54.333214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 14:18:54.333270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 14:18:54.333290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 14:18:54.333559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 52a9:00:00.0, compute capability: 3.7)
Learning rate: 0.01
Weights initialized by name from ../main/model/imagenet.h5
Using single GPU
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Epoch 1/100
And then nothing happens even if it leave it alone for a day. The call that it seems to freeze on is:
squeeze.model.fit_generator(train_generator, epochs=EPOCHS, verbose=1,
steps_per_epoch=nbatches_train, callbacks=cb)
Where the parameters are:
train_generator = generator_from_data_path(img_names, gt_names, config=cfg)
EPOCHS = 100
nbatches_train = 53
callbacks = [# TensorBoard object, ReduceLROnPlateau object, ModelCheckpoint object #]
My versions:
Python 3.5.4 :: Anaconda custom (64-bit)
tensorflow-gpu : 1.8.0
tensorflow : 1.8.0
Keras : 2.2.0
Formatting conversation in comments to answer.
The culprit was train_generator.
I have looked into sources of model.fit_generator in Keras some time ago. It just retrieves some data from the generator and submits it to the backend, nothing magical :)
So, my hypothesis was that it cannot retrieve data from the generator because the generator does not generate anything.
#Barker has confirmed it, stating that call to next(train_generator) hangs.
I personally have moved to keras.utils.Sequence that supports indexing and length and is much more convenient than ordinary generators. Though this note is not related to the current problem.
Related
I'm using kaggle TPU to train a tensorflow CycleGAN model. Everything is fine after training starts, but training freezes randomly after a few models. RAM has not exploded during training according to kaggle.
I've met with warnings during training as such:
2022-11-28 07:22:58.323282: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1669620178.323159560","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0","grpc_status":3}
Epoch 5/200
When I'm configuring the TPUs I've warnings as:
2022-11-28 13:56:35.038036: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.040789: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2022-11-28 13:56:35.040821: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-28 13:56:35.040850: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (06e37d3ac4e4): /proc/driver/nvidia/version does not exist
2022-11-28 13:56:35.043518: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-28 13:56:35.044759: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.079672: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.079743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.098707: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.098760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.101231: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:30020
Tensorflow version is 2.4.1, other configs I haven't touched. My model.fit function looks like such:
history = gan_model.fit(gan_ds,
epochs=EPOCHS,
callbacks=[GANMonitor()],
steps_per_epoch=(max(n_monet_samples, n_photo_samples)//BATCH_SIZE),
verbose=2,
workers=0).history
Most parts of the code comes from a kaggle tutorial, but I've changed the model architecture. Is there a way to solve this issue?
🙏
I've tried configuring it to verbose=1 and saw that training freezes on a random step in the middle of an epoch. The number of epochs I'm able to go through seems to be depending on the model architecture and batchsize, so I think there's some issue with memory?
I tried to run below two tutorials on v3-8 and I encountered similar warnings in both the runs.
https://www.kaggle.com/code/philculliton/a-simple-petals-tf-2-2-notebook
https://www.kaggle.com/code/amyjang/monet-cyclegan-tutorial
But they didn't break the training.
Could you please check if the original tutorial code runs for a significant number of epochs? If yes, you might need to review your changes to the model architecture.
Also, if batch_size is affecting the number of training epochs, then most probably it's an Out of Memory error. Try reducing the batch_size preferably to a factor of 128 per core and see if the run completes.
More resources -
How improper batch_size can lead to OOM - https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies
Profiling guide - https://cloud.google.com/tpu/docs/cloud-tpu-tools
Feel free to explore our in-depth guides on TPUs with excellent tutorials - https://cloud.google.com/tpu/docs/intro-to-tpu
I am training an English-Vietnamese NMT model using fairseq.
fairseq tells that it is training the model on 1 GPU. However, when I check the GPU, It seems not to be used and the training process is very slow.
screenshot: GPU usage
Training on 63k sentences corpus: an epoch takes about 1 hours. (model: fconv)
Training on 233k sentences corpus: an epoch takes about 4 hours. (model: transformer)
screenshot: console log
My GPU is NVIDIA GeForce GTX 1050 and the CUDA version is 10.2.
Am I successfully training the model on GPU?
Glad to see your solutions/suggestions.
Background:
I am a Python Developer new to TensorFlow.
System Spec:
i5-7200U CPU # 2.50GHz × 4
GeForce 940MX 4GB
Ubuntu 18
I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)
Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!
docker-compose.yml
version: '2.3'
services:
tensorflow:
# image: tensorflow/tensorflow:latest-gpu-py3
image: tensorflow/tensorflow:latest-py3
runtime: nvidia
volumes:
- ./:/notebooks/TensorTest1
ports:
- 8888:8888
When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.
root#e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result: [3. 3. 3. ... 3. 3. 3.]
when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.
root#baf68fc71921:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result: [3. 3. 3. ... 3. 3. 3.]
My Code
import tensorflow as tf
import time
with tf.Session():
start_time = time.time()
input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
output = tf.add(input1, input2)
result = output.eval()
duration = time.time() - start_time
print("TIME:", duration)
print("result: ", result)
Am I doing something wrong here? Based on prints it seems to be using GPU correctly
Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this
A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.
The computer I use is 1080 which has a GPU memory of 8 GB, the memory of my computer is 32 GB, but the array data might be to large for me to restore, the computer tells me resource exhausted. if there is anyway to solve this problem or evaluate the GPU memory i need for such a large numpy array so I can buy a better computer for calculate.by the way the batch_size I use is 1 so i have reduce the memory to the minimal, or i should consider to reduce the raw column and the height of my numpy array, and I think that would effect the resolution of my results that would be okay.
If anyone can answer my question. thanks
The tensor itself you are using is big, but not that big for a 8Gb GPU. 144 * 144 * 144 * 128 is ~380 million, so even with 32-bit items it requires 1.5GiB. I have a GeForce GTX 1070 with 8Gb (same size as you) and here's my Tensorflow experiment:
import numpy as np
import tensorflow as tf
X = tf.placeholder(dtype=tf.int32, shape=(1, 144, 144, 144, 128))
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
value = session.run([X], feed_dict={X: np.zeros(shape=(1, 144, 144, 144, 128))})
print np.array(value).shape
The output:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7465
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 4.14GiB
2017-08-17 20:05:54.312424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-08-17 20:05:54.312430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-08-17 20:05:54.312444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
(1, 1, 144, 144, 144, 128)
Note that free memory is much lower than 8Gb, because I use 2 UHD monitors. So this might be the first cause in your case: other processes can consume a lot of GPU resources.
Next, you didn't provide your Neural Network architecture, but if you are using, for instance, Deep Convolutional Neural Networks, note that the first layers are consuming a lot of memory for parameters and gradients. You might want to read this helpful page for details. If this is the case, you might need to plug in another GPU and split the graph across all available GPUs (here's how you can do it). There are 12Gb memory GPUs available from NVidia.
Finally, you can always consider reducing the floating precision tf.float64 -> tf.float32 -> tf.float16 for all your variables. This can save 8x memory, which sometimes is just enough to run on a GPU.
I followed this tutoriel to export my own trained tensorflow model to c++ and I got errors when I call freeze_graph
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0)
...
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/Const_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: save/Const_1 = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/device:GPU:0"]()]]
Caused by op u'save/Const_1', defined at:
...
GPU:0 is detected and usable by Tensorflow, so I don't understand from where the error comes from.
Any idea ?
The error means op save/Const_1 is trying to get placed on GPU, and there's no GPU implementation of that node. In fact Const nodes are CPU only and are stored as part of Graph object, so it can't be placed on GPU. One work-around is to run with allow_soft_placement=True, or to open the pbtxt file and manually remove the device line for that node