I'm using kaggle TPU to train a tensorflow CycleGAN model. Everything is fine after training starts, but training freezes randomly after a few models. RAM has not exploded during training according to kaggle.
I've met with warnings during training as such:
2022-11-28 07:22:58.323282: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1669620178.323159560","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0","grpc_status":3}
Epoch 5/200
When I'm configuring the TPUs I've warnings as:
2022-11-28 13:56:35.038036: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.040789: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2022-11-28 13:56:35.040821: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-28 13:56:35.040850: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (06e37d3ac4e4): /proc/driver/nvidia/version does not exist
2022-11-28 13:56:35.043518: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-28 13:56:35.044759: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.079672: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.079743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.098707: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.098760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.101231: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:30020
Tensorflow version is 2.4.1, other configs I haven't touched. My model.fit function looks like such:
history = gan_model.fit(gan_ds,
epochs=EPOCHS,
callbacks=[GANMonitor()],
steps_per_epoch=(max(n_monet_samples, n_photo_samples)//BATCH_SIZE),
verbose=2,
workers=0).history
Most parts of the code comes from a kaggle tutorial, but I've changed the model architecture. Is there a way to solve this issue?
🙏
I've tried configuring it to verbose=1 and saw that training freezes on a random step in the middle of an epoch. The number of epochs I'm able to go through seems to be depending on the model architecture and batchsize, so I think there's some issue with memory?
I tried to run below two tutorials on v3-8 and I encountered similar warnings in both the runs.
https://www.kaggle.com/code/philculliton/a-simple-petals-tf-2-2-notebook
https://www.kaggle.com/code/amyjang/monet-cyclegan-tutorial
But they didn't break the training.
Could you please check if the original tutorial code runs for a significant number of epochs? If yes, you might need to review your changes to the model architecture.
Also, if batch_size is affecting the number of training epochs, then most probably it's an Out of Memory error. Try reducing the batch_size preferably to a factor of 128 per core and see if the run completes.
More resources -
How improper batch_size can lead to OOM - https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies
Profiling guide - https://cloud.google.com/tpu/docs/cloud-tpu-tools
Feel free to explore our in-depth guides on TPUs with excellent tutorials - https://cloud.google.com/tpu/docs/intro-to-tpu
Related
I am running a deep learning script but I am not an expert. My task is to run multiple GPUs for data training. However, I have trouble specifying GPUs. Here are the steps of my confusion.
I set multiple GPUs by
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 2
when I start the model creation, I receive this error, which I did not receive when using only ONE gpu.
tf.random.set_seed(42)
model_unet = binary_unet(256,256,6)
ResourceExhaustedError: OOM when allocating tensor with shape[3,3,64,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TruncatedNormal]
I thought I could set one GPU first (for step 2) by making the cuda_visible_devices to be the ONE gpu wanted, and specify multiple GPUs (after step 2) by making the cuda_visible_devices to be multiple GPUs. But then, tensorflow couldn't recognize multiple GPUs that, for example:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 1
Note the number of GPUs available becomes 1. This will stick around unless I restart the kernel and clear output. Actually, I need to restart the kernel and clear output between step 1 and 2 as well, to make sure only 1 GPU is available and step 2 doesn't fail. But I can't just restart and clear everything because I am going to use previous outputs to run epochs.
I believe some potential solutions are: 1) make step 2 (creating a unet model) run with multiple GPUs; 2) somehow clear the logs in tensorflow without having to restart the kernel that I can create model with 1 GPU but train data/run epoch with multiple GPUs. But I have no idea how to do this. Could someone help?
I want to train a network using multiple gpus( 2x NVIDIA RTX A6000 ), on a windows 11 machine.
I tried copying the Multi-GPU and distributed training code from https://keras.io/guides/distributed_training/
However i see that GPU 0 is utilized just fine, but the GPU 1 is only utilized a little bit.
Here is a picture of the utilization:
GPUs utilization
While using the
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
I can even see huge gaps in the utiliation of GPU 1 as seen in:
GPUs utliziation .
Meaning for several epochs the second gpu was not utilized at all.
The only difference between the code in the example and my code is that I set epochs to 20, and I use:
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Since running without HierarchicalCopyAllReduce() results in an error:
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
Increasing batch size to 512 seems to help a lot and second gpu is utilized.GPUs utilization using 512 batch size
I also tried running the code with , strategy.experimental_distribute_dataset again with 512 batch size since this batch size utilized both GPUs well, however doing so makes the second gpu be not used as seen in picture below
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
train_dataset = strategy.experimental_distribute_dataset(train_dataset)
val_dataset = strategy.experimental_distribute_dataset(val_dataset)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
#model.fit(train_dataset, epochs=20, validation_data=val_dataset)
model.fit(train_dataset, epochs=20, validation_data=val_dataset, steps_per_epoch=98, validation_steps=98)
And again i see that the gpu utilization vanished GPUs utilization using experimental_distribute_dataset
My question is:
Why is the second GPU hardly utilized, isn't the batch split between the GPUs equally, ie if batch size is 128 one gpu receives 64 and the other gpu also 64? I assumed that the same model is run on both gpus and they both get half the batch to process, after which the reduce happens.
If the batch was split the same way wouldn't both gpus be similarly utilized even with small batch size?
Also why does distributing dataset using the strategy make utilization worse?
I am trying to parallelize a model with embedding layer, on tensorflow version 2.4.1 . But it is throwing me the following error :
InvalidArgumentError: Cannot assign a device for operation sequential/emb_layer/embedding_lookup/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node sequential/emb_layer/embedding_lookup/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU XLA_CPU XLA_GPU
Cast: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
ResourceSparseApplyAdagradV2: CPU
_Arg: GPU CPU XLA_CPU XLA_GPU
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
Colocation members, user-requested devices, and framework assigned devices, if any:
sequential_emb_layer_embedding_lookup_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
adagrad_adagrad_update_update_0_resourcesparseapplyadagradv2_accum (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
sequential/emb_layer/embedding_lookup/ReadVariableOp (ReadVariableOp)
sequential/emb_layer/embedding_lookup/axis (Const)
sequential/emb_layer/embedding_lookup (GatherV2)
gradient_tape/sequential/emb_layer/embedding_lookup/Shape (Const)
gradient_tape/sequential/emb_layer/embedding_lookup/Cast (Cast)
Adagrad/Adagrad/update/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0
[[{{node sequential/emb_layer/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_631]
Simplified the model to a basic model to make it reproducible :
import tensorflow as tf
central_storage_strategy = tf.distribute.MirroredStrategy()
with central_storage_strategy.scope():
user_model = tf.keras.Sequential([
tf.keras.layers.Embedding(10, 2, name = "emb_layer")
])
user_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1), loss="mse")
user_model.fit([1],[[1,2]], epochs=3)
Any help will be highly appreciated. Thanks !
So finally I figured out the problem, if anyone is looking for an answer.
Tensorflow does not have complete GPU implementation of Adagrad optimizer as of now. ResourceSparseApplyAdagradV2 operation gives error on GPU, which is integral to embedding layer. So it can not be used with embedding layer with data parallelism strategies. Using Adam or rmsprop works fine.
I'm a noob when it comes to Python and machine learning. I'm trying to run two different projects that have to do with something called Deep Image Matting:
https://github.com/Joker316701882/Deep-Image-Matting with Tensorflow
https://github.com/huochaitiantang/pytorch-deep-image-matting with Pytorch
I'm just trying to run the tests in these projects but I run into various problems. Can I run these on a machine without GPU? I thought that GPU is only for speeding up processing, but I'm only interested in seeing these run before getting a machine with GPU.
I apologize in advance, as I know I'm a total noob in this
When I try the Tensorflow project:
I get an error with this line gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = args.gpu_fraction) probably because I was tf2 and this requires tf1
After I downgraded to tf1 when I try to run the test I get W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
and InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'MaxPoolWithArgmax' with these attrs. Registered devices: [CPU], Registered kernels:
<no registered kernels> and now I'm stuck because I have no clue what this means
When I try the Pytorch project:
First I get this error: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
So I added map_location=torch.device('cpu') when the model is loaded, but now I get RuntimeError: Error(s) in loading state_dict for VGG16:
size mismatch for conv6_1.weight: copying a param with shape torch.Size([512, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]). And I'm stuck again
Can someone help?
thank you in advance!
For the PyTorch one, there were two problems and it looks like you've solved the first one on your own with map_location. The second problem is that the weights in your checkpoint and the weights in your model don't have the same shape! A quick detour to the github repo; let's visit net.py in core. Take a look at lines 26 to 28:
# model released before 2019.09.09 should use kernel_size=1 & padding=0
# self.conv6_1 = nn.Conv2d(512, 512, kernel_size=1, padding=0,bias=True)
self.conv6_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1,bias=True)
I'm guessing the checkpoint is loading weights where conv6_1 has a kernel size of 1 rather than 3, like the commented out line of code. So try uncommenting the line with kernel_size=1 and comment out the line with kernel_size=3.
I was trying to enable support for multiple gpu training in my model. The code is as follows:
with tf.device('/cpu:0'):
# creating a model
multi_gpu_model = keras.utils.multi_gpu_model(model, gpus=2, )
multi_gpu_model.compile(loss='cosine_proximity', optimizer='nadam', metrics = ['accuracy'])
try:
multi_gpu_model.fit_generator(
sequential_generator('/home/jindal/notebooks/witter_dataset_sequences_20m', batch_size, total_lines),
steps_per_epoch=steps_per_epoch, epochs=epochs,callbacks=[WeightsSaver(model, 200)]
)
except Exception as e:
print(e)
Now when I try to run it, I get an error as follows:
creating a partition for /device:CPU:3 which doesn't exist in the list of available devices. Available devices: /device:CPU:0,/device:XLA_CPU:0,/device:XLA_GPU:0,/device:GPU:0,/device:GPU:1,/device:GPU:2,/device:GPU:3.
I think that I have read the documentation correctly. My keras package is up to date. I have opened an issue on Github as well. What is it that I am doing wrong?
Note: I am using Jupyter Notebook and have 4 gpus available to me each having 12 GB of Ram. I have limited the Jupyter Notebook to use only 2 GPUs (GPU number 2 and 3) by using the command
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=2, 3