Is it possible to use a smaller minibatch size and still fully utilize a large GPU? My GPU can handle massive minibatch sizes (5000+) very quickly but with lower accuracy. I’ve read about people sending smaller minibatch sizes to multiple GPU’s in parallel but I’m wondering if it is possible to send several smaller minibatch sizes to the same GPU in parallel.
So for example training 16 minibatches of 64 in parallel instead of training one 1024 minibatch at a time. Or alternatively training 16 distinct models at the same time with small batches and aggregating the results.
If you just using stander library like tensorflow, then this shouldn't be the type of problem you need to consider, it will try to utilize your GPU as much as possible at backend, if you get lower accuracy during training, try to find the problem in code and data before looking at other way.
FWIW, you could try to optimize your GPU performance little more by following the guide of tensorflow optimize performance on 1 GPU
Related
I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)
If I have a single GPU with 8GB RAM and I have a TensorFlow model (excluding training/validation data) that is 10GB, can TensorFlow train the model?
If yes, how does TensorFlow do this?
Notes:
I'm not looking for distributed GPU training. I want to know about single GPU case.
I'm not concerned about the training/validation data sizes.
No you can not train a model larger than your GPU's memory. (there may be some ways with dropout that I am not aware of but in general it is not advised). Further you would need more memory than even all the parameters you are keeping because your GPU needs to retain the parameters along with the derivatives for each step to do back-prop.
Not to mention the smaller batch size this would require as there is less space left for the dataset.
I am running the GPT-2 code of the large model(774M). It is used for the generation of text samples through interactive_conditional_samples.py , link: here
So I've given an input file containing prompts which are automatically selected to generate output. This output is also automatically copied into a file. In short, I'm not training it, I'm using the model to generate text.
Also, I'm using a single GPU.
The problem I'm facing in this is, The code is not utilizing the GPU fully.
By using nvidia-smi command, I was able to see the below image
https://imgur.com/CqANNdB
It depends on your application. It is not unusual to have low GPU utilization when the batch_size is small. Try increasing the batch_size for more GPU utilization.
In your case, you have set batch_size=1 in your program. Increase the batch_size to a larger number and verify the GPU utilization.
Let me explain using MNIST size networks. They are tiny and it's hard to achieve high GPU (or CPU) efficiency for them. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models, the statistical efficiency drops off very quickly after a batch_size=100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
Hope this answers your question. Happy Learning.
I use multigpu to train a model with pytorch. One gpu uses more memory than others, causing "out-of-memory". Why would one gpu use more memory? Is it possible to make the usage more balanced? Is there other ways to reduce memory usage? (Deleting variables that will not be used anymore...?) The batch size is already 1. Thanks.
DataParallel splits the batch and sends each split to a different GPU, each GPU has a copy of the model, then the forward pass is computed independently and then the outputs of each GPU are collected back to one GPU instead of computing loss independently in each GPU.
If you want to mitigate this issue you can include the loss computation in the DataParallel module.
If doing this is still an issue, then you might want model parallelism instead of data parallelism: move different parts of your model to different GPUs using .cuda(gpu_id). This is useful when the weights of your model are pretty large.
I have been testing with a word2vec model. This word2vec model for some reason doesn't use the gpu much. My performance is roughly 1 epoch every 30 seconds with a ~2000 samples dataset.
This doesn't seem normal. There are researchers that have gigabytes of training data, and I doubt they are waiting for months for the training to finish.
My GPU is a gtx 970. The memory usage is around 10% (Note that I have a few programs open too)
The problem might be the batches itself, although I am not sure.
Basically I run a method at the start of the training, and then while training I iterate over the entries in that list.
This is roughly how I do this.
Is my approach wrong? (I would guess that it's not suitable for huge datasets)
batch_method(batch_size=x) # I tested with different sizes, all seem to train fine, from 2 to 512.
for epo in self.epochs_num:
for batch in self.batch_list:
for input,target in batch:
...