I am trying to parallelize a model with embedding layer, on tensorflow version 2.4.1 . But it is throwing me the following error :
InvalidArgumentError: Cannot assign a device for operation sequential/emb_layer/embedding_lookup/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node sequential/emb_layer/embedding_lookup/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ResourceSparseApplyAdagradV2: CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
sequential_emb_layer_embedding_lookup_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
adagrad_adagrad_update_update_0_resourcesparseapplyadagradv2_accum (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
sequential/emb_layer/embedding_lookup/ReadVariableOp (ReadVariableOp)
sequential/emb_layer/embedding_lookup/axis (Const)
sequential/emb_layer/embedding_lookup (GatherV2)
gradient_tape/sequential/emb_layer/embedding_lookup/Shape (Const)
gradient_tape/sequential/emb_layer/embedding_lookup/Cast (Cast)
Adagrad/Adagrad/update/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0
[[{{node sequential/emb_layer/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_631]
Simplified the model to a basic model to make it reproducible :
import tensorflow as tf
central_storage_strategy = tf.distribute.MirroredStrategy()
with central_storage_strategy.scope():
user_model = tf.keras.Sequential([
tf.keras.layers.Embedding(10, 2, name = "emb_layer")
user_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1), loss="mse")
user_model.fit([1],[[1,2]], epochs=3)
Any help will be highly appreciated. Thanks !
So finally I figured out the problem, if anyone is looking for an answer.
Tensorflow does not have complete GPU implementation of Adagrad optimizer as of now. ResourceSparseApplyAdagradV2 operation gives error on GPU, which is integral to embedding layer. So it can not be used with embedding layer with data parallelism strategies. Using Adam or rmsprop works fine.
I'm using kaggle TPU to train a tensorflow CycleGAN model. Everything is fine after training starts, but training freezes randomly after a few models. RAM has not exploded during training according to kaggle.
I've met with warnings during training as such:
2022-11-28 07:22:58.323282: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1669620178.323159560","description":"Error received from peer ipv4:","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0","grpc_status":3}
Epoch 5/200
When I'm configuring the TPUs I've warnings as:
2022-11-28 13:56:35.038036: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.040789: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2022-11-28 13:56:35.040821: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-28 13:56:35.040850: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (06e37d3ac4e4): /proc/driver/nvidia/version does not exist
2022-11-28 13:56:35.043518: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-28 13:56:35.044759: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.079672: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 ->}
2022-11-28 13:56:35.079743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.098707: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 ->}
2022-11-28 13:56:35.098760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.101231: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:30020
Tensorflow version is 2.4.1, other configs I haven't touched. My model.fit function looks like such:
history = gan_model.fit(gan_ds,
steps_per_epoch=(max(n_monet_samples, n_photo_samples)//BATCH_SIZE),
Most parts of the code comes from a kaggle tutorial, but I've changed the model architecture. Is there a way to solve this issue?
I've tried configuring it to verbose=1 and saw that training freezes on a random step in the middle of an epoch. The number of epochs I'm able to go through seems to be depending on the model architecture and batchsize, so I think there's some issue with memory?
I tried to run below two tutorials on v3-8 and I encountered similar warnings in both the runs.
But they didn't break the training.
Could you please check if the original tutorial code runs for a significant number of epochs? If yes, you might need to review your changes to the model architecture.
Also, if batch_size is affecting the number of training epochs, then most probably it's an Out of Memory error. Try reducing the batch_size preferably to a factor of 128 per core and see if the run completes.
More resources -
How improper batch_size can lead to OOM - https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies
Profiling guide - https://cloud.google.com/tpu/docs/cloud-tpu-tools
Feel free to explore our in-depth guides on TPUs with excellent tutorials - https://cloud.google.com/tpu/docs/intro-to-tpu
I have been stuck at trying to train my PyTorch model in GPU. The model perfectly works in CPU though. I have been using Google Colab's GPU resources for using cuda.
I know that in order to run a model in GPU, the 'model', 'input features' and 'target' needs to be in 'cuda' device.
But, no matter what I do in my code, I either keep getting the error:
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Here is my notebook:
It would be really helpful if someone could let me exactly which variables to be moved using .to('cuda')
Additionally, explanations/suggestions for ensuring that this does not recur in the future would be highly appreciated. Thank you !
Your self.hidden is a tuple of torch.tensors. PyTorch doesn't automatically move these kind of tensor to GPU when .to(device) is invoked on your model.
You can either:
Implement your own to(self, type, device) method for your BiLSTM_CRF class. (Not recommended).
Make self.hidden a registered buffer. This way all methods of nn.Module such as .to(), .float(), etc. will also be applied to self.hidden.
first you have configure the device you want to use, if you are on GPU change it to CPU and the reverse is also true
I want to train a network using multiple gpus( 2x NVIDIA RTX A6000 ), on a windows 11 machine.
I tried copying the Multi-GPU and distributed training code from https://keras.io/guides/distributed_training/
However i see that GPU 0 is utilized just fine, but the GPU 1 is only utilized a little bit.
Here is a picture of the utilization:
GPUs utilization
While using the
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
I can even see huge gaps in the utiliation of GPU 1 as seen in:
GPUs utliziation .
Meaning for several epochs the second gpu was not utilized at all.
The only difference between the code in the example and my code is that I set epochs to 20, and I use:
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Since running without HierarchicalCopyAllReduce() results in an error:
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
Increasing batch size to 512 seems to help a lot and second gpu is utilized.GPUs utilization using 512 batch size
I also tried running the code with , strategy.experimental_distribute_dataset again with 512 batch size since this batch size utilized both GPUs well, however doing so makes the second gpu be not used as seen in picture below
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
train_dataset = strategy.experimental_distribute_dataset(train_dataset)
val_dataset = strategy.experimental_distribute_dataset(val_dataset)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
#model.fit(train_dataset, epochs=20, validation_data=val_dataset)
model.fit(train_dataset, epochs=20, validation_data=val_dataset, steps_per_epoch=98, validation_steps=98)
And again i see that the gpu utilization vanished GPUs utilization using experimental_distribute_dataset
My question is:
Why is the second GPU hardly utilized, isn't the batch split between the GPUs equally, ie if batch size is 128 one gpu receives 64 and the other gpu also 64? I assumed that the same model is run on both gpus and they both get half the batch to process, after which the reduce happens.
If the batch was split the same way wouldn't both gpus be similarly utilized even with small batch size?
Also why does distributing dataset using the strategy make utilization worse?
I'm trying to train a model on tensorflow(v1.9.0 on python2) with adadelta optimizer on a GPU. It shows the following error.
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'embedding_matrix_de/read': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'
Colocation Debug Info:
Colocation group had the following types and devices:
UnsortedSegmentSum: GPU CPU
Unique: GPU CPU
Shape: GPU CPU
StridedSlice: GPU CPU
GatherV2: GPU CPU
SparseApplyAdadelta: CPU
Const: GPU CPU
Identity: CPU
VariableV2: GPU CPU
Colocation members and user-requested devices:
embedding_matrix_de (VariableV2)
embedding_matrix_de/read (Identity)
embedding_lookup/axis (Const)
embedding_lookup (GatherV2)
gradients/embedding_lookup_grad/Shape (Const)
gradients/embedding_lookup_grad/ToInt32 (Cast)
embedding_matrix_de/Adadelta (VariableV2)
embedding_matrix_de/Adadelta_1 (VariableV2)
Adadelta/update_embedding_matrix_de/Unique (Unique)
Adadelta/update_embedding_matrix_de/Shape (Shape)
Adadelta/update_embedding_matrix_de/strided_slice/stack (Const)
Adadelta/update_embedding_matrix_de/strided_slice/stack_1 (Const)
Adadelta/update_embedding_matrix_de/strided_slice/stack_2 (Const)
Adadelta/update_embedding_matrix_de/strided_slice (StridedSlice)
Adadelta/update_embedding_matrix_de/UnsortedSegmentSum (UnsortedSegmentSum)
Adadelta/update_embedding_matrix_de/SparseApplyAdadelta (SparseApplyAdadelta)
[[Node: embedding_matrix_de/read = Identity[T=DT_FLOAT, _class=["loc:#embedding_matrix_de"]](embedding_matrix_de)]]
And when i replace adadelta with adam, there are no issues. Some pieces of code are given below.
embedding_matrix_decode = tf.get_variable(
shape=[trainVocabSize, embedding_size],
optimizer = tf.train.AdadeltaOptimizer()
I encountered the same issue with Tensorflow 2.1.1. Adadelta optimizer seems to have no support on GPU nor TPU.
I followed this tutoriel to export my own trained tensorflow model to c++ and I got errors when I call freeze_graph
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/Const_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: save/Const_1 = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/device:GPU:0"]()]]
Caused by op u'save/Const_1', defined at:
GPU:0 is detected and usable by Tensorflow, so I don't understand from where the error comes from.
Any idea ?
The error means op save/Const_1 is trying to get placed on GPU, and there's no GPU implementation of that node. In fact Const nodes are CPU only and are stored as part of Graph object, so it can't be placed on GPU. One work-around is to run with allow_soft_placement=True, or to open the pbtxt file and manually remove the device line for that node