Slow training of BERT model Hugging face

Slow training of BERT model Hugging face - python

I am training the binary classfier using BERT model implement in hugging face library
training_args = TrainingArguments(
"deleted_tweets_trainer",
num_train_epochs = 1,
#logging_steps=100,
evaluation_strategy='steps',
remove_unused_columns = True
)
I am using Colab TPU still the training time is a lot, 38 hours for 60 hours cleaned tweets.
Is there any way to optimise the training?

You are currently evaluating every 500 steps and have a training and eval batch size of 8.
Depending on your current memory consumption, you can increase the batch sizes (eval much more as training consumes more memory):
per_device_train_batch_size
per_device_eval_batch_size
In case it matches your use case, you can also increase the steps after an evaluation is started;
eval_steps

Related

Trouble Understanding ResNet Implementation

I'm having trouble understanding and replicating the original implementation of ResNet on the CIFAR-10 dataset, as described in the paper "Deep Residual Learning for Image Recognition". Specifically, I have a few questions about the following passage:
We use a weight decay of 0.0001 and momentum of 0.9,
and adopt the weight initialization in [13] and BN [16] but
with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning
rate of 0.1, divide it by 10 at 32k and 48k iterations, and
terminate training at 64k iterations, which is determined on
a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side,
and a 32×32 crop is randomly sampled from the padded
image or its horizontal flip. For testing, we only evaluate
the single view of the original 32×32 image.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
How can I implement the training and learning rate scheduling in PyTorch? Since 45000 isn't divisible by 128, should I drop the last 72 images every epoch? Also, since the 32k, 48k, and 64k milestones don't fall on a whole number of epochs, should I round them to the nearest epochs? Or is there a way to change the learning rate and terminate training in the middle of an epoch?
If anyone could point me in the right direction, I greatly appreciate it. I'm new to deep learning, so thank you for your help and kind understanding.

What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
When running two GPUs on the same machine then the batch size is split between the GPUs, as you've said. The gradient produced by both GPUs will be transfered, averaged and applied on one of the GPUs, or possibly on the CPU.
Here's more info: https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
I encourage everyone to think in terms of iterations rather than epochs. Each iteration equates to a single weight update, which is much more relevant to model convergence than an epoch is. If you think in epochs you have to adjust the number of epochs of training every time you try a different batch size. This isn't the case if you use think in terms of iterations (aka training steps, or weight updates). But your formula is correct in computing epochs.
How can I implement the training and learning rate scheduling in PyTorch?
I think this pytorch post answers the question, it looks like this was added to pytorch (sorry for a non authoritative answer here, I'm more familiar with Tensorflow):
https://forums.pytorchlightning.ai/t/training-for-a-set-number-of-iterations-without-setting-epochs/178
https://github.com/Lightning-AI/lightning/pull/5687
You can also just use epochs of course, and adjusting the learning rate doesn't have to happen exactly at the same point as the paper describes, as near as you can reasonably get with rounding error will work just fine.

Are deep and wide autoencoder trainings just slow or is there something wrong here?

I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)

Keras model taking too long to train

So I have the following model for sentiment analysis (using pre trained word embeddings):
And as visible, I have a pre trained embedding matrix and only about 500k trainable parameters. So why does it take a whole eternity to train this model? The batch size is 128 and number of epochs is 25. And the ETA for first epoch is about 10 minutes. I haven't even completed that.
Just to mention, I am not using CUDA or anything. I don't think I have a GPU enabled Tensorflow. And I'm willing to do anything to increase the speed. And I have Tensorflow 2.1.0.

And here's the answer I am not using CUDA or anything. Training on CPU is much slower than on GPU. If you don't have high-performance enough video card, you can use several services such as Google Colab or Kaggle

Keras + TensorFlow Model

I'm currently creating a model and while creating it I came with some questions. Does training the same model with the same data multiple times leads to better precision of those objects, since your training it every time? And what could be the issue when sometimes the object gets 90% precision and when I re-run it gets lower precision or even not predicting the right object? Is it because of Tensorflow running on the GPU?

I will guess that you are doing image recognition and that you want to identify images (objects) using a neuronal network made with Keras. You should train it once, but during training you will do several epochs, meaning the algorithm adapts the weights in several rounds (epochs). For each round it goes over all training images. Once trained, you can use the model to identify images/objects.
You can evaluate the accuracy of your trained model over the same training set, but it is better to use a different set (see train_test_split from sklearn for instance).
Training is a stochastic process, meaning that every time you train your network it will be different in the end. Hence, you will get different accurcies. The stochasticity comes from different initial weights or from using stochastic gradient descent methods for instance.
The question does not appear to have anything to do with Keras or TensorFlow but basic understandting of how neuronal networks work. There is no connection to running Tensorflow on the GPU. You will also not get better precision by training with the same objects. If you train your model on a dataset for a very long time (many epochs), you might get into overfitting. Then, the accuracy of your model on this training dataset will be very high, but the model will have low accuracy on other datasets.

A common technique is split your date in train and validation datasets, then repeatedly train your model using EarlyStopping. This will train on the training dataset, then calculate the loss against the validation dataset, and then keep training until no further improvement is seen. You can set a patience parameter to wait for X epochs without an improvement to stop training (and optionally save the best model)
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Another trick is image augmentation with ImageDataGenerator which will generate synthetic data for you (rotations, shifts, mirror images, brightness adjusts, noise etc). This can effectively increase the amount of data you have to train with, thus reducing overfitting.
https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/

How one can quickly verify that a CNN actually learns?

I tried to build a CNN from scratch based on LeNet architecture from this article
I implemented backdrop and now trying to train it on the MNIST dataset using SGD with 16 batch size. I want to find a quick way to verify that the learning goes well and there are no bugs. For this, I visualize loss for every 100th batch but it takes too long on my laptop and I don't see an overall dynamic (the loss fluctuates downwards, but occasionally jumps up back so I am not sure). Could anyone suggest a proven way to find that the CNN works well without waiting many hours of training?

The MNIST consist of 60k datasets of 28 * 28 pixel.Training a CNN with batch size 16 will have 4000 forward pass per epochs.
Now taking into consideration that your are using LeNet which not a very deep model.
I would suggest you to do followings:
Check your PC specifications such as RAM,Processor,GPU etc.
Try your to train your model on cloud service such Google Colab, Kaggle and others
Try a batch size of 128 or 64
Try to normalize your image data set before training
Training speed also depends on machine learning framework you are using such as Tensorflow, Pytorch etc.
I hope this will help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.