How do I find bottlenecks in my model?

How do I find bottlenecks in my model? - python

I have been testing with a word2vec model. This word2vec model for some reason doesn't use the gpu much. My performance is roughly 1 epoch every 30 seconds with a ~2000 samples dataset.
This doesn't seem normal. There are researchers that have gigabytes of training data, and I doubt they are waiting for months for the training to finish.
My GPU is a gtx 970. The memory usage is around 10% (Note that I have a few programs open too)
The problem might be the batches itself, although I am not sure.
Basically I run a method at the start of the training, and then while training I iterate over the entries in that list.
This is roughly how I do this.
Is my approach wrong? (I would guess that it's not suitable for huge datasets)
batch_method(batch_size=x) # I tested with different sizes, all seem to train fine, from 2 to 512.
for epo in self.epochs_num:
for batch in self.batch_list:
for input,target in batch:
...

Related

In deep learning, can the prediction speed increase as the batch size decreases?

We are developing a prediction model using deepchem's GCNModel.
Model learning and performance verification proceeded without problems, but it was confirmed that a lot of time was spent on prediction.
We are trying to predict a total of 1 million data, and the parameters used are as follows.
model = GCNModel(n_tasks=1, mode='regression', number_atom_features=32, learning_rate=0.0001, dropout=0.2, batch_size=32, device=device, model_dir=model_path)
I changed the batch size to improve the performance, and it was confirmed that the time was faster when the value was decreased than when the value was increased.
All models had the same GPU memory usage.
From common sense I know, it is estimated that the larger the batch size, the faster it will be. But can you tell me why it works in reverse?
We would be grateful if you could also let us know how we can further improve the prediction time.

let's clarify some definitions first.
Epoch
Times that your model and learning algorithm will walk through your dataset.
(Complete passes)
BatchSize
The number of samples(every single row of your training data) before updating the internal model. in other words, the number of samples processed before the model is updated.
So Your batch size is something between 1 and your len(training_data)
Generally, more batch size gives more accuracy of training data.
Epoch ↑ Batch Size ↑ Accuracy ↑ Speed ↓
So the short answer to question is more batch size takes more memory and needs more process and obviously takes longer time to learn.
https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network
Here is the link for more details.

There are two components regarding the speed:
Your batch size and model size
Your CPU/GPU power in spawning and processing batches
And two of them need to be balanced. For example, if your model finishes prediction of this batch, but the next batch is not yet spawned, you will notice a drop in GPU utilization for a brief moment. Sadly there is no inner metrics that directly tell you this balance - try using time.time() to benchmark your model's prediction as well as the dataloader speed.
However, I don't think that's worth the effort, so you can keep decreasing the batch size up to the point there is no improvement - that's where to stop.

Are deep and wide autoencoder trainings just slow or is there something wrong here?

I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)

Speeding up small minibatches on a GPU

Is it possible to use a smaller minibatch size and still fully utilize a large GPU? My GPU can handle massive minibatch sizes (5000+) very quickly but with lower accuracy. I’ve read about people sending smaller minibatch sizes to multiple GPU’s in parallel but I’m wondering if it is possible to send several smaller minibatch sizes to the same GPU in parallel.
So for example training 16 minibatches of 64 in parallel instead of training one 1024 minibatch at a time. Or alternatively training 16 distinct models at the same time with small batches and aggregating the results.

If you just using stander library like tensorflow, then this shouldn't be the type of problem you need to consider, it will try to utilize your GPU as much as possible at backend, if you get lower accuracy during training, try to find the problem in code and data before looking at other way.
FWIW, you could try to optimize your GPU performance little more by following the guide of tensorflow optimize performance on 1 GPU

Tensorflow not fully utilizing GPU in GPT-2 program

I am running the GPT-2 code of the large model(774M). It is used for the generation of text samples through interactive_conditional_samples.py , link: here
So I've given an input file containing prompts which are automatically selected to generate output. This output is also automatically copied into a file. In short, I'm not training it, I'm using the model to generate text.
Also, I'm using a single GPU.
The problem I'm facing in this is, The code is not utilizing the GPU fully.
By using nvidia-smi command, I was able to see the below image
https://imgur.com/CqANNdB

It depends on your application. It is not unusual to have low GPU utilization when the batch_size is small. Try increasing the batch_size for more GPU utilization.
In your case, you have set batch_size=1 in your program. Increase the batch_size to a larger number and verify the GPU utilization.
Let me explain using MNIST size networks. They are tiny and it's hard to achieve high GPU (or CPU) efficiency for them. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models, the statistical efficiency drops off very quickly after a batch_size=100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
Hope this answers your question. Happy Learning.

Should i use the same epochs for each batch?

i need to understand how the epochs/iterations affect the training of a deep learning model.
I am training a NER model with Spacy 2.1.3, my documents are very long so i cannot train more than 200 documents per iteration. So basically i do
from the document 0 to the document 200 -> 20 epochs
from the document 201 to the document 400 -> 20 epochs
and so on.
Maybe, it is a stupid question but, should the epochs of the next batches be the same as the first 0-200? so if i chose 20 epochs i must train the next with 20 epochs too?
Thanks

i need to understand how the epochs/iterations affect the training of a deep learning model - nobody is sure about that one. You may overfit after certain amount of epochs, you should check you accuracy (or other metrics) on validation dataset. Techniques like Early Stopping are often employed in order to battle this.
so i cannot train more than 200 documents per iteration. - do you mean a batch of examples? If so, it should be smaller (too much information in single iteration and too costly). 32 is usually used for textual data, up to 64. Batch sizes are often smaller the more epochs you train in order to get into the minimum better (or to escape saddle points).
Furthermore, you should use Python's generators so you can iterate over data of size bigger than your RAM capacity.
Last but not least, each example is usually trained once per epoch. Different approaches (say oversampling or undersampling) are sometimes used but usually when your classes distribution is imbalanced (say 10% examples belong to class0and 90% to class1`) or neural network has problems with specific class (though this one requires more well thought out approach).

The common practice is to train each batch with only 1 epoch. Training on the same subset of data for 20 epochs can lead to overfitting which harms your model performance.
To understand better how epochs number trained on each batch affect your performance you can do a grid search and compare the results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I find bottlenecks in my model? - python

Related

In deep learning, can the prediction speed increase as the batch size decreases?

Are deep and wide autoencoder trainings just slow or is there something wrong here?

Speeding up small minibatches on a GPU

Tensorflow not fully utilizing GPU in GPT-2 program

Should i use the same epochs for each batch?

Categories

Resources