Validation loss not moving with MLP in Regression

Validation loss not moving with MLP in Regression - python

Given input features as such, just raw numbers:
tensor([0.2153, 0.2190, 0.0685, 0.2127, 0.2145, 0.1260, 0.1480, 0.1483, 0.1489,
0.1400, 0.1906, 0.1876, 0.1900, 0.1925, 0.0149, 0.1857, 0.1871, 0.2715,
0.1887, 0.1804, 0.1656, 0.1665, 0.1137, 0.1668, 0.1168, 0.0278, 0.1170,
0.1189, 0.1163, 0.2337, 0.2319, 0.2315, 0.2325, 0.0519, 0.0594, 0.0603,
0.0586, 0.0067, 0.0624, 0.2691, 0.0617, 0.2790, 0.2805, 0.2848, 0.2454,
0.1268, 0.2483, 0.2454, 0.2475], device='cuda:0')
And the expected output is a single real number output, e.g.
tensor(-34.8500, device='cuda:0')
Full code on https://www.kaggle.com/alvations/pytorch-mlp-regression
I've tried creating a simple 2 layer network with:
class MLP(nn.Module):
def __init__(self, input_size, output_size, hidden_size):
super(MLP, self).__init__()
self.linear = nn.Linear(input_size, hidden_size)
self.classifier = nn.Linear(hidden_size, output_size)
def forward(self, inputs, hidden=None, dropout=0.5):
inputs = F.dropout(inputs, dropout) # Drop-in.
# First Layer.
output = F.relu(self.linear(inputs))
# Matrix manipulation magic.
batch_size, sequence_len, hidden_size = output.shape
# Technically, linear layer takes a 2-D matrix as input, so more manipulation...
output = output.contiguous().view(batch_size * sequence_len, hidden_size)
# Apply dropout.
output = F.dropout(output, dropout)
# Put it through the classifier
# And reshape it to [batch_size x sequence_len x vocab_size]
output = self.classifier(output).view(batch_size, sequence_len, -1)
return output
And training as such:
# Training routine.
def train(num_epochs, dataloader, valid_dataset, model, criterion, optimizer):
losses = []
valid_losses = []
learning_rates = []
plt.ion()
x_valid, y_valid = valid_dataset
for _e in range(num_epochs):
for batch in tqdm(dataloader):
# Zero gradient.
optimizer.zero_grad()
#print(batch)
this_x = torch.tensor(batch['x'].view(len(batch['x']), 1, -1)).to(device)
this_y = torch.tensor(batch['y'].view(len(batch['y']), 1, 1)).to(device)
# Feed forward.
output = model(this_x)
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, this_y.view(len(batch['y']), -1))
loss.backward()
optimizer.step()
losses.append(torch.sqrt(loss.float()).data)
with torch.no_grad():
# Zero gradient.
optimizer.zero_grad()
output = model(x_valid.view(len(x_valid), 1, -1))
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, y_valid.view(len(y_valid), -1))
valid_losses.append(torch.sqrt(loss.float()).data)
clear_output(wait=True)
plt.plot(losses, label='Train')
plt.plot(valid_losses, label='Valid')
plt.legend()
plt.pause(0.05)
Tuning several hyperparameters, it looks like the model doesn't train well, the validation loss doesn't move at all e.g.
hyperparams = Hyperparams(input_size=train_dataset.x.shape[1],
output_size=1,
hidden_size=150,
loss_func=nn.MSELoss,
learning_rate=1e-8,
optimizer=optim.Adam,
batch_size=500)
And it's loss curve:
Any idea what's wrong with the network?
Am I training the regression model with the wrong loss? Or I've just not yet found the right hyperparameters?
Or am I validating the model wrongly?

From the code you provided, it is tough to say why the validation loss is constant but I see several problems in your code.
Why do you validate for each training mini-batch? Instead, you should validate your model after you do the training for one complete epoch (iterating over your full dataset once). So, the skeleton should be like:
for _e in range(num_epochs):
for batch in tqdm(train_dataloader):
# training code
with torch.no_grad():
for batch in tqdm(valid_dataloader):
# validation code
# plot your loss values
Also, you can plot after each epoch, not after each mini-batch training.
Did you check whether the model parameters are getting updated after optimizer.step() during training? How many validation examples do you have? Why don't you use mini-batch computation during validation?
Why do you do: optimizer.zero_grad() during validation? It doesn't make sense because, during validation, you are not going to do anything related to optimization.
You should use model.eval() during validation to turn off the dropouts. See PyTorch documentation to learn about .train() and .eval() methods.
The learning rate is set to 1e-8, isn't it too small? Why don't you use the default learning rate for Adam (1e-3)?
The following requires some reasoning.
Why are you using such a large batch size? What is your training dataset size?
You can directly plot the MSELoss, instead of taking the square root.
My suggestion would be: use some existing resources for MLP in PyTorch. Don't do it from scratch if you do not have sufficient knowledge at this point. It would make you suffer a lot.

Related

How to do inference on a test dataset too large for RAM?

I'm training a network to classify audio. First I extract logmel-spectrograms from my audio data, save these in arrays and train my network using these. At each epoch I inference on my test data to get an accuracy estimate.
My training dataset is 24GB and test dataset is 6GB. Both are too large for the RAM. I found that I could extract the logmel-specs from my training data before running the network, save each minibatch in a pickle file, then load these one by one during training.
However, I use .eval() to get the accuracy from my my whole test data at once. This worked when I used smaller datasets as there was no need to split my data up into chunks using different pickle files. However, I'm now trying to figure out how to run the .eval() line or equivalent so that it provides accuracy for the whole test dataset, rather than the smaller chunks I've split it into. Is there a way I can get overall accuracy for my test data using pickle files or another method?
Here is the key component of code at the end where I think this can be done:
correct = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_input, 1))
test_accuracy = tf.reduce_mean(tf.cast(correct, 'float')) #changes correct to type: float
test_accuracy1 = test_accuracy.eval({features_input:X_test, labels_input:y_test})
test_accuracy_scores.append(test_accuracy1)
print('Test accuracy:', test_accuracy1)
Here is my entire codeblock for the network:
### Train NN, output results
r"""This uses the VGGish model definition within a larger model which adds two
layers on top, and then trains this larger model.
We input log-mel spectrograms (X_train) calculated above with associated labels
(y_train), and feed the batches into the model. Once the model is trained, it
is then executed on the test log-mel spectrograms (X_test) and the accuracy is output.
Alongside .csv file with the predictions for each 0.96s chunk and their true
class is also output for the test data. Column1 = the logit for the first class,
Column2 = the logit for the scond class etc. The final column is the true class.
"""
num_min_batches = len(os.listdir(pickle_files_dir))/2
os.chdir(scripts_directory)
def main(X):
with tf.Graph().as_default(), tf.Session() as sess:
# Define VGGish.
embeddings = vggish_slim.define_vggish_slim(training=FLAGS.train_vggish)
# Define a shallow classification model and associated training ops on top
# of VGGish.
with tf.variable_scope('mymodel'):
# Add a fully connected layer with 100 units. Add an activation function
# to the embeddings since they are pre-activation.
num_units = 100
fc = slim.fully_connected(tf.nn.relu(embeddings), num_units)
# Add a classifier layer at the end, consisting of parallel logistic
# classifiers, one per class. This allows for multi-class tasks.
logits = slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='logits')
tf.sigmoid(logits, name='prediction')
linear_out= slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='linear_out')
logits = tf.sigmoid(linear_out, name='logits')
# Add training ops.
with tf.variable_scope('train'):
global_step = tf.train.create_global_step()
# Labels are assumed to be fed as a batch multi-hot vectors, with
# a 1 in the position of each positive class label, and 0 elsewhere.
labels_input = tf.placeholder(
tf.float32, shape=(None, _NUM_CLASSES), name='labels')
# Cross-entropy label loss.
xent = tf.nn.sigmoid_cross_entropy_with_logits(
logits=logits, labels=labels_input, name='xent')
loss = tf.reduce_mean(xent, name='loss_op')
tf.summary.scalar('loss', loss)
# We use the same optimizer and hyperparameters as used to train VGGish.
optimizer = tf.train.AdamOptimizer(
learning_rate=vggish_params.LEARNING_RATE,
epsilon=vggish_params.ADAM_EPSILON)
train_op = optimizer.minimize(loss, global_step=global_step)
# Initialize all variables in the model, and then load the pre-trained
# VGGish checkpoint.
sess.run(tf.global_variables_initializer())
vggish_slim.load_vggish_slim_checkpoint(sess, FLAGS.checkpoint)
# The training loop.
features_input = sess.graph.get_tensor_by_name(
vggish_params.INPUT_TENSOR_NAME)
validation_accuracy_scores = []
test_accuracy_scores = []
for epoch in range(num_epochs):
epoch_loss = 0
i=0
while i < num_min_batches:
#print('mini batch'+str(i))
X_pickle_file = pickle_files_dir + 'X_train_mini_batch_' + str(i)
with open(X_pickle_file, "rb") as fp: # Unpickling
batch_x = pickle.load(fp)
y_pickle_file = pickle_files_dir + 'y_train_mini_batch_' + str(i)
with open(y_pickle_file, "rb") as fp: # Unpickling
batch_y = pickle.load(fp)
_, c = sess.run([train_op, loss], feed_dict={features_input: batch_x, labels_input: batch_y})
epoch_loss += c
i+=1
#print no. of epochs and loss
print('Epoch', epoch+1, 'completed out of', num_epochs,', loss:',epoch_loss)
#note this adds a small computational cost
correct = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_input, 1))
test_accuracy = tf.reduce_mean(tf.cast(correct, 'float')) #changes correct to type: float
test_accuracy1 = test_accuracy.eval({features_input:X_test, labels_input:y_test})
test_accuracy_scores.append(test_accuracy1)
print('Test accuracy:', test_accuracy1)
if __name__ == '__main__':
tf.app.run()

Pytorch error when launching two distinct backward

I am building a simple autoencoder followed by an MLP neural nets. Regarging the autoencoder I am not running into any problem
# ---- Prepare training set ----
x_data = train_set_categorical.drop(["churn"], axis=1).to_numpy()
labels = train_set_categorical.loc[:, "churn"].to_numpy()
dataset = TensorDataset(torch.Tensor(x_data), torch.Tensor(labels) )
loader = DataLoader(dataset, batch_size=127)
# ---- Model Initialization ----
model = AE()
# Validation using MSE Loss function
loss_function = nn.MSELoss()
# Using an Adam Optimizer with lr = 0.1
optimizer = torch.optim.Adam(model.parameters(),
lr = 1e-1,
weight_decay = 1e-8)
epochs = 50
outputs = []
losses = []
for epoch in range(epochs):
for (image, _) in loader:
# Output of Autoencoder
embbeding, reconstructed = model(image)
# Calculating the loss function
loss = loss_function(reconstructed, image)
# The gradients are set to zero,
# the the gradient is computed and stored.
# .step() performs parameter update
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Storing the losses in a list for plotting
losses.append(loss)
if epoch == 49:
outputs.append(embbeding)
But then I am feeding an MLP with the outcome of the autoencoder and this is where things starts to fail
class Feedforward(torch.nn.Module):
def __init__(self):
super().__init__()
self.neural = torch.nn.Sequential(
torch.nn.Linear(33, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 450),
torch.nn.ReLU(),
torch.nn.Linear(450, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 1),
torch.nn.Sigmoid()
)
def forward(self, x):
outcome = self.neural(x.float())
return outcome
modelz = Feedforward()
criterion = torch.nn.BCELoss()
opt = torch.optim.Adam(modelz.parameters(), lr = 0.01)
modelz.train()
epoch = 20
for epoch in range(epoch):
opt.zero_grad()
# Forward pass
y_pred = modelz(x_train)
# Compute Loss
loss_2 = criterion(y_pred.squeeze(), torch.tensor(y_train).to(torch.float32))
#print('Epoch {}: train loss: {}'.format(epoch, loss.item()))
# Backward pass
loss_2.backward()
opt.step()
I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.
Of course I have tried to add "retain_graph=True" to both backwards or only the first one but it does not seem to solve the problem. If I launch both code independently from another It works but as a sequence I don't know why but it is not.

You should be able to disconnect the output of the auto-encoder from the model by calling embbeding.detach(), before appending it to outputs.

Gradient Accumulation with Custom model.fit in TF.Keras?

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to train a tf.keras model with Gradient Accumulation (GA). But I don't want to use it in the custom training loop (like) but customize the .fit() method by overriding the train_step.Is it possible? How to accomplish this? The reason is if we want to get the benefit of keras built-in functionality like fit, callbacks, we don't want to use the custom training loop but at the same time if we want to override train_step for some reason (like GA or else) we can customize the fit method and still get the leverage of using those built-in functions.
And also, I know the pros of using GA but what are the major cons of using it? Why does it's not come as a default but an optional feature with the framework?
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = n_gradients
self.gradient_accumulation = [
tf.zeros_like(this_var) for this_var in self.trainable_variables
]
def train_step(self, data):
x, y = data
batch_size = tf.cast(tf.shape(x)[0], tf.float32)
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(
y, y_pred, regularization_losses=self.losses
)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
accum_gradient = [
(acum_grad+grad) for acum_grad, grad in \
zip(self.gradient_accumulation, gradients)
]
accum_gradient = [
this_grad/batch_size for this_grad in accum_gradient
]
# apply accumulated gradients
self.optimizer.apply_gradients(
zip(accum_gradient, self.trainable_variables)
)
# TODO: reset self.gradient_accumulation
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
Please, run and check with the following toy setup.
# Model
size = 32
input = keras.Input(shape=(size,size,3))
efnet = keras.applications.DenseNet121(
weights=None,
include_top = False,
input_tensor = input
)
base_maps = keras.layers.GlobalAveragePooling2D()(efnet.output)
base_maps = keras.layers.Dense(
units=10, activation='softmax',
name='primary'
)(base_maps)
custom_model = CustomTrainStep(
n_gradients=10, inputs=[input], outputs=[base_maps]
)
# bind all
custom_model.compile(
loss = keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = keras.optimizers.Adam()
)
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
Update
I've found that some others also tried to achieve this and ended up with the same issue. One has got some workaround, here, but it's too messy and I think there should be some better approach.
Update 2
The accepted answer (by Mr.For Example) is fine and works well in single strategy. Now, I like to start 2nd bounty to extend it to support multi-gpu, tpu, and with mixed-precision techniques. There are some complications, see details.

Yes it is possible to customize the .fit() method by overriding the train_step without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation:
import tensorflow as tf
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]
def train_step(self, data):
self.n_acum_step.assign_add(1)
x, y = data
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign_add(gradients[i])
# If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
def apply_accu_gradients(self):
# apply accumulated gradients
self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))
# reset
self.n_acum_step.assign(0)
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))
# Model
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)
Outputs:
Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748
Pros:
Gradient accumulation is a mechanism to split the batch of samples —
used for training a neural network — into several mini-batches of
samples that will be run sequentially
Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, i.e using less memory to training the model like it using large batch size.
Example: If you run a gradient accumulation with steps of 5 and batch
size of 4 images, it serves almost the same purpose of running with a
batch size of 20 images.
We could also parallel the training when using GA, i.e aggregate gradients from multiple machines.
Things to consider:
This technique is working so well so it is widely used, there few things to consider before using it that I don't think it should be called cons, after all, all GA does is turning 4 + 4 to 2 + 2 + 2 + 2.
If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine's memory already can handle.
Reference:
What is Gradient Accumulation in Deep Learning?

Thanks to #Mr.For Example for his convenient answer.
Usually, I also observed that using Gradient Accumulation, won't speed up training since we are doing n_gradients times forward pass and compute all the gradients. But it will speed up the convergence of our model. And I found that using the mixed_precision technique here can be really helpful here. Details here.
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
Here is a complete gist.

How to evaluate a single image in PyTorch model?

I used this code to train a model:
def train(model, epochs):
for epoch in range(epochs):
for idx, batch in enumerate(train_loader):
x, bndbox = batch # unpack batch
pred_bndbox = model(x)# forward pass
#print('label:', bndbox, 'prediction:', pred_bndbox)
loss = criterion(pred_bndbox, bndbox) # compute loss for this batch
optimiser.zero_grad()# zero gradients of optimiser
loss.backward() # backward pass (find rate of change of loss with respect to model parameters)
optimiser.step()# take optimisation step
print('Epoch:', epoch, 'Batch:', idx, 'Loss:', loss.item())
writer.add_scalar('DETECTION Loss/Train', loss, epoch*len(train_loader) + idx) # write loss to a graph
train(cnn, epochs)
torch.save(cnn.state_dict(), str(time.time()))# save model
def visualise(model, n):
model.eval()
for idx, batch in enumerate(test_loader):
x, y = batch
pred_bndbox = model(x)
S40dataset.show(batch, pred_bndbox=pred_bndbox)
if idx == n:
break
How do I evaluate the model prediction on a single image to check the operation of the neural network?

You can use:
model.eval() # turn the model to evaluate mode
with torch.no_grad(): # does not calculate gradient
class_index = model(single_image).argmax() #gets the prediction for the image's class
This code will save the network's prediction as the index of the class in the class_index variable. You have to save the image you would like to examine in the single_image variable in the right shape.
Hope that helps.

calculate perplexity in pytorch

I've just trained an LSTM language model using pytorch. The main body of the class is this:
class LM(nn.Module):
def __init__(self, n_vocab,
seq_size,
embedding_size,
lstm_size,
pretrained_embed):
super(LM, self).__init__()
self.seq_size = seq_size
self.lstm_size = lstm_size
self.embedding = nn.Embedding.from_pretrained(pretrained_embed, freeze = True)
self.lstm = nn.LSTM(embedding_size,
lstm_size,
batch_first=True)
self.fc = nn.Linear(lstm_size, n_vocab)
def forward(self, x, prev_state):
embed = self.embedding(x)
output, state = self.lstm(embed, prev_state)
logits = self.fc(output)
return logits, state
Now I want to write a function which calculates how good a sentence is, based on the trained language model (some score like perplexity, etc.).
I'm a bit confused and I don't know how should I calculate this. A similar sample would be of greate use.

When using Cross-Entropy loss you just use the exponential function torch.exp() calculate perplexity from your loss.
(pytorch cross-entropy also uses the exponential function resp. log_n)
So here is just some dummy example:
import torch
import torch.nn.functional as F
num_classes = 10
batch_size = 1
# your model outputs / logits
output = torch.rand(batch_size, num_classes)
# your targets
target = torch.randint(num_classes, (batch_size,))
# getting loss using cross entropy
loss = F.cross_entropy(output, target)
# calculating perplexity
perplexity = torch.exp(loss)
print('Loss:', loss, 'PP:', perplexity)
In my case the output is:
Loss: tensor(2.7935) PP: tensor(16.3376)
You just need to be beware of that if you want to get the per-word-perplexity you need to have per word loss as well.
Here is a neat example for a language model that might be interesting to look at that also computes the perplexity from the output:
https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model/main.py#L30-L50

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.