I want to train a network using multiple gpus( 2x NVIDIA RTX A6000 ), on a windows 11 machine.
I tried copying the Multi-GPU and distributed training code from https://keras.io/guides/distributed_training/
However i see that GPU 0 is utilized just fine, but the GPU 1 is only utilized a little bit.
Here is a picture of the utilization:
GPUs utilization
While using the
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
I can even see huge gaps in the utiliation of GPU 1 as seen in:
GPUs utliziation .
Meaning for several epochs the second gpu was not utilized at all.
The only difference between the code in the example and my code is that I set epochs to 20, and I use:
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Since running without HierarchicalCopyAllReduce() results in an error:
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
Increasing batch size to 512 seems to help a lot and second gpu is utilized.GPUs utilization using 512 batch size
I also tried running the code with , strategy.experimental_distribute_dataset again with 512 batch size since this batch size utilized both GPUs well, however doing so makes the second gpu be not used as seen in picture below
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
train_dataset = strategy.experimental_distribute_dataset(train_dataset)
val_dataset = strategy.experimental_distribute_dataset(val_dataset)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
#model.fit(train_dataset, epochs=20, validation_data=val_dataset)
model.fit(train_dataset, epochs=20, validation_data=val_dataset, steps_per_epoch=98, validation_steps=98)
And again i see that the gpu utilization vanished GPUs utilization using experimental_distribute_dataset
My question is:
Why is the second GPU hardly utilized, isn't the batch split between the GPUs equally, ie if batch size is 128 one gpu receives 64 and the other gpu also 64? I assumed that the same model is run on both gpus and they both get half the batch to process, after which the reduce happens.
If the batch was split the same way wouldn't both gpus be similarly utilized even with small batch size?
Also why does distributing dataset using the strategy make utilization worse?
Related
I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)
I'm playing with this colab locally with an 8gb rtx 3070 on Fedora 35 and tensorflow 2.4.0:
https://github.com/tensorflow/similarity/blob/master/examples/kaggle.ipynb
I have the same consistent errors on windows with a nvidia gtx 1050 ti (4gb).
I tried to decouple the origin of the OOM error on model.fit and seems linked to the preprocessing phase in which I'm resizing the validation set. In this phase the GPU vram is allocated.
# load validation image in memory
x_test = []
with tf.device('/CPU:0'): #solution to avoid occupation of GPU memory and OOM in model.fit
for p in tqdm(x_test_p):
img = tf.io.read_file(p)
img = tf.io.decode_image(img,dtype=tf.dtypes.float32)
img = tf.image.resize_with_pad(img, IMG_SIZE, IMG_SIZE)
# if grayscale, convert to rgb
if tf.shape(img)[2]==3:
pass
else:
img = tf.image.grayscale_to_rgb(img)
x_test.append(img)
If i reduce the validation set size alot the model.fit will succeed.
If i process the whole validation with CPU instead of GPU the
model.fit will succeed.
If i don't pass the whole preprocessed validation set to model.fit the OOM error will still be present. That's why i'm suggesting it's a problem related to the preprocessing alone occupying useful GPU Memory.
Is it possible that this preprocessing is loaded in VRAM and not released therefore limiting the model.fit GPU memory space left?
The problem is somehow similar to this question from which i took the idea of preprocessing the validation with CPU:
Keras OOM for data validation using GPU
I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?
Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.
I have the following code segment:
model.fit(x=train_x, y=train_y, batch_size=32, epochs=10, verbose=2, validation_data=(val_x, val_y), initial_epoch=0)
print(model.evaluate(test_x, test_y))
My GPU will still work with a batch size of 1024. However, this will severely penalize the frequency with which the model updates. Is it possible to load the images in groups of 1024 to the GPU but adjust the weights for the model every 32 images?
My intention is to improve performance by reducing the number of times the GPU has to fetch data from main memory since there is high latency involved with this operation. My question is similar to this one: How can you load all batch data into GPU memory in Keras (Theano backend)?
However, I am not necessarily trying to load all my data to the GPU at once, as the dataset is too large.
Thank you!
I am using Tensorflow and feed my data to the network using FIFO Queues. In the end my code uses tf.train.shuffle_batch(..) to generate the batches which I feed to my network. If I set my batch_size = 4 and my number of GPUs are lets say 4 as well. Would the actual batch_size be 16 or how does the number of GPUs impact the batch_size or what is the benefit of having more than one GPU?