I found surprising that I could not find any resources online on how to dynamically adapt the GPU batch size without halting training.
The idea is the following:
1) Have a training script that is (almost) agnostic to the GPU in use. The batch size will dynamically adjust without interference of the user or need for tunning.
2) Still being able to specifying the desired training batch size, even if too big to fit in the biggest known GPU.
For instance, let's say I want to train a model using a batch size of 4096 images, each image 1024x1024. Let's also say that I have access to a server with different NVidea GPUs, but I don't know which one will be assigned to me in advance. (Or that everybody wants to use the biggest GPU and that I am left waiting a long time before it is my term).
I want my training script to find the max batch size (let's say it is 32 images per GPU batch), and only update the optimizer when all 4096 images have been processed (one training batch = 128 GPU batches).
There are different ways of solving this problem. But if specifying the GPU that can do the job, or using multiple GPUs are not an option, then it is handy to dynamically adapt the GPU batch size.
I prepared this repo with an illustrative training example in pytorch (it should work similarly in TensorFlow)
In the code below, the try/except is used to try different GPU batch sizes without halting training. When the batch becomes too large, it is downsized and the adaptation is turned off. Please check the repo for the implementation details and possible bug fixes.
It is also implemented a technique called Batch Spoofing, which performs a number of forward passes before doing the backpropagation. In PyTorch it only requires replacing the optimizer.zero_grad().
import torch
import torchvision
import torch.optim as optim
import torch.nn as nn
# Example of how to use it with Pytorch
if __name__ == "__main__":
# #############################################################
# 1) Initialize the dataset, model, optimizer and loss as usual.
# Initialize a fake dataset
trainset = torchvision.datasets.FakeData(size=1_000_000,
image_size=(3, 224, 224),
num_classes=1000)
# initialize the model, loss and SGD-based optimizer
resnet = torchvision.models.resnet152(pretrained=True,
progress=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(resnet.parameters(), lr=0.01)
continue_training = True # criteria to stop the training
# #############################################################
# 2) Set parameters for the adaptive batch size
adapt = True # while this is true, the algorithm will perform batch adaptation
gpu_batch_size = 2 # initial gpu batch_size, it can be super small
train_batch_size = 2048 # the train batch size of desire
# Modified training loop to allow for adaptive batch size
while continue_training:
# #############################################################
# 3) Initialize dataloader and batch spoofing parameter
# Dataloader has to be reinicialized for each new batch size.
trainloader = torch.utils.data.DataLoader(trainset,
batch_size=int(gpu_batch_size),
shuffle=True)
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
try: # This will make sure that training is not halted when the batch size is too large
# #############################################################
# 4) Epoch loop with batch spoofing
optimizer.zero_grad() # done before training because of batch spoofing.
for i, (x, y) in enumerate(trainloader):
y_pred = resnet(x)
loss = criterion(y_pred, y)
loss.backward()
# batch spoofing
if not i % repeat:
optimizer.step()
optimizer.zero_grad()
# #############################################################
# 5) Adapt batch size while no RuntimeError is rased.
# Increase batch size and get out of the loop
if adapt:
gpu_batch_size *= 2
break
# Stopping criteria for training
if i > 100:
continue_training = False
# #############################################################
# 6) After the largest batch size is found, the training progresses with the fixed batch size.
# CUDA out of memory is a RuntimeError, the moment we will get to it when our batch size is too large.
except RuntimeError as run_error:
gpu_batch_size /= 2 # resize the batch size for the biggest that works in memory
adapt = False # turn off the batch adaptation
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
# Manual check if the RuntimeError was caused by the CUDA or something else.
print(f"---\nRuntimeError: \n{run_error}\n---\n Is it a cuda error?")
If you have code that can do similarly in Tensorflow, Caffe or others, please share!
how to dynamically adapt the GPU batch size without halting training
There is a very similar question that uses random sampler for the job.
I will just have to add another option: DataLoader has collate_fn you could use for altering the bs.
collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
Related
During training I can see that the GPU's are both active and running the data. Then once training is complete, I see the GPU activity drop to 0 and the CPU drops a little too. Could this have something to do with the way I am generating training data?
I have custom data generators feeding the model:
train_df, val_df = np.split(dataframe, [int(.8 * len(dataframe.index))])
trainbatches = math.ceil(len(train_df.index) / batchsize)
valbatches = math.ceil(len(val_df.index) / 1024)
train_gen1 = MixedGenerator(train_df, batchsize, scaler) # tensor Sequence
enqueued_train_gen = queue_generator(train_gen1)
val_gen1 = MixedGenerator(val_df, 1024, scaler)
enqueued_
val_gen = queue_generator(val_gen1)
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
print('numdata shape', (numdata.shape[1],))
with strategy.scope():
model = build_mixedinput_model(ir_imgs_shape=(120, 160, 1),
num_data_shape=(numdata.shape[1],),
opt=optimizer,
lossmet=lossmet,
dropout=dropout)
history = model.fit(enqueued_
train_gen,
steps_per_epoch=trainbatches,
validation_data=enqueued_
val_gen,
validation_steps=valbatches,
verbose=1,
epochs=epochs,
callbacks=clbks)
the training generator feeds batches of a predetermined size, but the validation generator feeds batches of size 1024. This is because I am training on small batches but I want the validation to go faster. My understanding is that this should work since batch size shouldn't matter for validation.
Is this normal behavior, and is there a better practice I could be using? The validation still takes place, though it takes some time and as I said does not make use of the GPU's.
Thanks in advance
UPDATE:
only happens in the first epoch of each run, the subsequent epochs do validation extremely quickly. I am still curious why GPU's and CPU are not engaging during the validation process though.
The first epoch takes more time because the code is being compiled into a graph. This includes the model function, the loss function, validation metrics, and others. The compilation process can take some time, but it is only done once. That is why you see faster results after the first epoch.
Here are the lines in the tensorflow.keras source that compile the training step and the validation step. Those lines essentially call tf.function on the train and validation steps.
Why does my model with batch norm layer behave differently in every running, when the model without batch norm layer performs the same. In my model, random seed has been set by:
np.random.seed(args.seed)
torch.manual_seed(args.seed)
random.seed(args.seed)
torch.cuda.manual_seed_all(args.seed)
os.environ['PYTHONHASHSEED'] = str(args.seed)
After removing the batch norm layer and maintaining other settings, my model produces the same results.
It appears that you need to also include the line:
torch.backends.cudnn.deterministic = True
In order to force the cuda portion of the algorithm to be deterministic in batch norm for PyTorch. Here are more details.
I am building a vanilla DQN model to play the OpenAI gym Cartpole game.
However, in the training step where I feed in the state as input and the target Q values as the labels, if I use model.fit(x=states, y=target_q), it works fine and the agent can eventually play the game well, but if I use model.train_on_batch(x=states, y=target_q), the loss won't decrease and the model will not play the game anywhere better than a random policy.
I wonder what is the difference between fit and train_on_batch? To my understanding, fit calls train_on_batch with a batch size of 32 under the hood which should make no difference since specifying the batch size to equal the actual data size I feed in makes no difference.
The full code is here if more contextual information is needed to answer this question: https://github.com/ultronify/cartpole-tf
model.fit will train 1 or more epochs. That means it will train multiple batches. model.train_on_batch, as the name implies, trains only one batch.
To give a concrete example, imagine you are training a model on 10 images. Let's say your batch size is 2. model.fit will train on all 10 images, so it will update the gradients 5 times. (You can specify multiple epochs, so it iterates over your dataset.) model.train_on_batch will perform one update of the gradients, as you only give the model on batch. You would give model.train_on_batch two images if your batch size is 2.
And if we assume that model.fit calls model.train_on_batch under the hood (though I don't think it does), then model.train_on_batch would be called multiple times, likely in a loop. Here's pseudocode to explain.
def fit(x, y, batch_size, epochs=1):
for epoch in range(epochs):
for batch_x, batch_y in batch(x, y, batch_size):
model.train_on_batch(batch_x, batch_y)
I'm training a simple fully convolutional network with batchnorm. I saved a checkpoint after the initialization. Then, I restored it and ran the training again (with the same hyperparameters). However, I got different results from the two training procedures. All my seeds (python, numpy, and tf) were set equal on the beginning of the two runs.
What could possibly be the reason for the mismatch?
Setting seeds in the file header will lead to different outcomes, since initialization will consume some of the random values before you get to training.
So you should set the seeds after performing initialization. The initialization can use any seed you want, including another fixed seed, but then you have to reset it again for training.
Here is some very high level pseudo-code, which assumes you have functions to to build, initialize, and train the model. And also functions to save and load the checkpoints.
def set_seeds(seed):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
# Make a checkpoint.
set_seeds(INIT_SEED)
model = build_model()
model = initialize_model(model)
save_checkpoint(model)
# Or load a checkpoint.
model = load_checkpoint()
# At this point seeds need to be identical.
# So re-fix the seeds for training, then proceed.
set_seeds(TRAIN_SEED)
trained = train_model(model)
I am very new to ML using Big Data and I have played with Keras generic convolutional examples for the dog/cat classification before, however when applying a similar approach to my set of images, I run into memory issues.
My dataset consists of very long images that are 10048 x1687 pixels in size. To circumvent the memory issues, I am using a batch size of 1, feeding in one image at a time to the model.
The model has two convolutional layers, each followed by max-pooling which together make the flattened layer roughly 290,000 inputs right before the fully-connected layer.
Immediately after running however, Memory usage chokes at its limit (8Gb).
So my questions are the following:
1) What is the best approach to process computations of such size in Python locally (no Cloud utilization)? Are there additional python libraries that I need to utilize?
Check out what yield does in python and the idea of generators. You do not need to load all of your data at the beginning. You should make your batch_size just small enough that you do not get memory errors.
Your generator can look like this:
def generator(fileobj, labels, memory_one_pic=1024, batch_size):
start = 0
end = start + batch_size
while True:
X_batch = fileobj.read(memory_one_pic*batch_size)
y_batch = labels[start:end]
start += batch_size
end += batch_size
if not X_batch:
break
if start >= amount_of_datasets:
start = 0
end = batch_size
yield (X_batch, y_batch)
...later when you already have your architecture ready...
train_generator = generator(open('traindata.csv','rb'), labels, batch_size)
train_steps = amount_of_datasets//batch_size + 1
model.fit_generator(generator=train_generator,
steps_per_epoch=train_steps,
epochs=epochs)
You should also read about batch_normalization, which basically helps to learn faster and with better accuracy.
While using train_generator(), you should also set the max_q_size parameter. It's set at 10 by default, which means you're loading in 10 batches while using only 1 (since train_generator() was designed to stream data from outside sources that can be delayed like network, not to save memory). I'd recommend setting max_q_size=1for your purposes.