UPDATE: To solve this, I kept the checkpoint structure the same but wrote a custom train_step function, with the help of the repo linked in the accepted answer of the question linked below, which calculated the gradients and used apply_weights rather than compiling the model and using train_on_batch. This lets the full GAN state be restored. Sadly, with this method I'm fairly sure the dropout layers no longer work as the discriminator is able to work perfectly very early in the training which prevents the model from training properly. Nevertheless, the original problem is solved.
Original:
I am currently training a GAN in keras and trying to make it so that I can save the model and resume training later. Ordinarily in keras you'd simply use model.save(), however for a GAN if the discriminator and GAN (combined generator and discriminator, with discriminator weights not trainable) models are saved and loaded separately then the link between them is broken and the GAN will not function as expected. Someone asked a similar question here, How to save and resume training a GAN with multiple model parts with Tensorflow 2/ Keras, and was told to use tf.train.Checkpoint instead to save the full model at once as a checkpoint.
I've tried implementing this as follows:
def train(epochs, batch_size):
checkpoint = tf.train.Checkpoint(g_optimizer=g_optimizer,
d_optimizer=d_optimizer,
generator=generator,
discriminator=discriminator,
gan=gan
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, 'checkpoints', max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
for epoch in epochs:
#train discriminator...
#train generator...
ckpt_manager.save()
where g_optimizer, d_optimizer are just tf.keras.optimizers.Adam objects and generator, discriminator and gan are tf.keras.Model objects.
When I use this approach, the link between the gan model and the discriminator is preserved after loading in the checkpoint. The training works normally at first, but after I stop and then resume training using the checkpoint the discriminator loss starts massively increasing and the generated data becomes nonsensical.
Recompiling the models are loading the checkpoint like this was only way I could think of doing it which uses the last state of the optimizer, but clearly something isn't right - rather than resuming the training from where it was, this approach is massively disrupting the training.
Have I used tf.train.Checkpoint incorrectly for what I'm trying to do? Please let me know if there's any more information you need to be able to address the question.
Edit, have added full code by request:
Here is the code that creates the models in the first place and then trains them, in this setup the models are compiled initially when first created, and then compiled again if resuming from a checkpoint using the latest optimizer state. I appreciate it's weird to compile twice but I couldn't think of another way to use the latest optimizer state from the checkpoint, if there's a better way I'm very happy to change it. Note, the unusual GRU-based GAN is because I'm testing out being able to generate variable length time-series. There's a lot of data specific stuff in there but hopefully on the whole it makes sense. train_df is just a pandas DataFrame containing all the training data
def build_generator():
input = Input(shape=(None, latent_dims))
gru1 = GRU(100, activation='relu', return_sequences=True)(input)
gru2 = GRU(100, activation='relu', return_sequences=True (gru1)
output = GRU(9, return_sequences=True, activation='sigmoid')(gru2)
model = Model(input, output)
return model
def build_discriminator():
input = Input(shape=(None, 9))
gru1 = GRU(100, return_sequences=True)(input)
gru2 = GRU(100, return_sequences=True)(gru1)
output = GRU(1, activation='sigmoid')(gru2)
model = Model(input, output)
return model
d_optimizer = opt.Adam(learning_rate=lr)
g_optimizer = opt.Adam(learning_rate=lr)
# Build discriminator
discriminator = build_discriminator()
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
# Build generator
generator = build_generator()
# Build combined model
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
def train(epochs, batch_size=1): #Only works with batch size of 1 currently
sne = train_df.sn.unique()
n_batches = int(len(sne) / batch_size)
rng = np.random.default_rng(123)
checkpoint = tf.train.Checkpoint(g_optimizer=g_optimizer,
d_optimizer=d_optimizer,
generator=generator,
discriminator=discriminator,
gan=gan
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, 'checkpoints', max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
for epoch in range(epochs):
rng.shuffle(sne)
g_losses, d_losses = [], []
for batch in range(n_batches):
real = np.random.uniform(0.0, 0.1, (batch_size, 1)) # Used instead of np.zeros to avoid zero gradients
fake = np.random.uniform(0.9, 1.0, (batch_size, 1)) # Used instead of np.ones to avoid zero gradients
# Select real data
sn = sne[batch]
sndf = train_df[train_df.sn == sn]
X = sndf[['g_t', 'r_t', 'i_t', 'z_t', 'g', 'r', 'i', 'z', 'g_err', 'r_err', 'i_err', 'z_err']].values
X = X.reshape((1, *X.shape))
noise = rand.normal(size=(batch_size, latent_dims))
noise = np.reshape(noise, (batch_size, 1, latent_dims))
noise = np.repeat(noise, X.shape[1], 1)
gen_lcs = generator.predict(noise)
# Train discriminator
d_loss_real = discriminator.train_on_batch(X, real)
d_loss_fake = discriminator.train_on_batch(gen_lcs, fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# Train generator
noise = rand.normal(size=(2 * batch_size, latent_dims))
noise = np.reshape(noise, (2 * batch_size, 1, latent_dims))
noise = np.repeat(noise, X.shape[1], 1)
gen_labels = np.zeros((2 * batch_size, 1))
g_loss = gan.train_on_batch(noise, gen_labels)
g_losses.append(g_loss)
d_losses.append(d_loss)
ckpt_manager.save()
full_g_loss = np.mean(g_losses)
full_d_loss = np.mean(d_losses)
print(f'{epoch + 1}/{epochs} g_loss={full_g_loss}, d_loss={full_d_loss})
train()
If you have the following checkpoint structure, your model should work properly:
checkpoint_dir = 'checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(generator_opt=generator_opt,
discriminator_opt=discriminator_opt,
gan_opt=gan_opt,
generator=generator,
discriminator=discriminator,
GAN = GAN
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
print ('Latest checkpoint restored!!')
Note that the GAN model has its own optimizer. And then in your training loop, just save checkpoints at certain intervals, for example every 10 epochs.
for epoch in range(epochs):
...
...
...
if epoch%10 == 0:
ckpt_manager.save()
Related
I'm training a network to classify audio. First I extract logmel-spectrograms from my audio data, save these in arrays and train my network using these. At each epoch I inference on my test data to get an accuracy estimate.
My training dataset is 24GB and test dataset is 6GB. Both are too large for the RAM. I found that I could extract the logmel-specs from my training data before running the network, save each minibatch in a pickle file, then load these one by one during training.
However, I use .eval() to get the accuracy from my my whole test data at once. This worked when I used smaller datasets as there was no need to split my data up into chunks using different pickle files. However, I'm now trying to figure out how to run the .eval() line or equivalent so that it provides accuracy for the whole test dataset, rather than the smaller chunks I've split it into. Is there a way I can get overall accuracy for my test data using pickle files or another method?
Here is the key component of code at the end where I think this can be done:
correct = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_input, 1))
test_accuracy = tf.reduce_mean(tf.cast(correct, 'float')) #changes correct to type: float
test_accuracy1 = test_accuracy.eval({features_input:X_test, labels_input:y_test})
test_accuracy_scores.append(test_accuracy1)
print('Test accuracy:', test_accuracy1)
Here is my entire codeblock for the network:
### Train NN, output results
r"""This uses the VGGish model definition within a larger model which adds two
layers on top, and then trains this larger model.
We input log-mel spectrograms (X_train) calculated above with associated labels
(y_train), and feed the batches into the model. Once the model is trained, it
is then executed on the test log-mel spectrograms (X_test) and the accuracy is output.
Alongside .csv file with the predictions for each 0.96s chunk and their true
class is also output for the test data. Column1 = the logit for the first class,
Column2 = the logit for the scond class etc. The final column is the true class.
"""
num_min_batches = len(os.listdir(pickle_files_dir))/2
os.chdir(scripts_directory)
def main(X):
with tf.Graph().as_default(), tf.Session() as sess:
# Define VGGish.
embeddings = vggish_slim.define_vggish_slim(training=FLAGS.train_vggish)
# Define a shallow classification model and associated training ops on top
# of VGGish.
with tf.variable_scope('mymodel'):
# Add a fully connected layer with 100 units. Add an activation function
# to the embeddings since they are pre-activation.
num_units = 100
fc = slim.fully_connected(tf.nn.relu(embeddings), num_units)
# Add a classifier layer at the end, consisting of parallel logistic
# classifiers, one per class. This allows for multi-class tasks.
logits = slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='logits')
tf.sigmoid(logits, name='prediction')
linear_out= slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='linear_out')
logits = tf.sigmoid(linear_out, name='logits')
# Add training ops.
with tf.variable_scope('train'):
global_step = tf.train.create_global_step()
# Labels are assumed to be fed as a batch multi-hot vectors, with
# a 1 in the position of each positive class label, and 0 elsewhere.
labels_input = tf.placeholder(
tf.float32, shape=(None, _NUM_CLASSES), name='labels')
# Cross-entropy label loss.
xent = tf.nn.sigmoid_cross_entropy_with_logits(
logits=logits, labels=labels_input, name='xent')
loss = tf.reduce_mean(xent, name='loss_op')
tf.summary.scalar('loss', loss)
# We use the same optimizer and hyperparameters as used to train VGGish.
optimizer = tf.train.AdamOptimizer(
learning_rate=vggish_params.LEARNING_RATE,
epsilon=vggish_params.ADAM_EPSILON)
train_op = optimizer.minimize(loss, global_step=global_step)
# Initialize all variables in the model, and then load the pre-trained
# VGGish checkpoint.
sess.run(tf.global_variables_initializer())
vggish_slim.load_vggish_slim_checkpoint(sess, FLAGS.checkpoint)
# The training loop.
features_input = sess.graph.get_tensor_by_name(
vggish_params.INPUT_TENSOR_NAME)
validation_accuracy_scores = []
test_accuracy_scores = []
for epoch in range(num_epochs):
epoch_loss = 0
i=0
while i < num_min_batches:
#print('mini batch'+str(i))
X_pickle_file = pickle_files_dir + 'X_train_mini_batch_' + str(i)
with open(X_pickle_file, "rb") as fp: # Unpickling
batch_x = pickle.load(fp)
y_pickle_file = pickle_files_dir + 'y_train_mini_batch_' + str(i)
with open(y_pickle_file, "rb") as fp: # Unpickling
batch_y = pickle.load(fp)
_, c = sess.run([train_op, loss], feed_dict={features_input: batch_x, labels_input: batch_y})
epoch_loss += c
i+=1
#print no. of epochs and loss
print('Epoch', epoch+1, 'completed out of', num_epochs,', loss:',epoch_loss)
#note this adds a small computational cost
correct = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_input, 1))
test_accuracy = tf.reduce_mean(tf.cast(correct, 'float')) #changes correct to type: float
test_accuracy1 = test_accuracy.eval({features_input:X_test, labels_input:y_test})
test_accuracy_scores.append(test_accuracy1)
print('Test accuracy:', test_accuracy1)
if __name__ == '__main__':
tf.app.run()
I am building a simple autoencoder followed by an MLP neural nets. Regarging the autoencoder I am not running into any problem
# ---- Prepare training set ----
x_data = train_set_categorical.drop(["churn"], axis=1).to_numpy()
labels = train_set_categorical.loc[:, "churn"].to_numpy()
dataset = TensorDataset(torch.Tensor(x_data), torch.Tensor(labels) )
loader = DataLoader(dataset, batch_size=127)
# ---- Model Initialization ----
model = AE()
# Validation using MSE Loss function
loss_function = nn.MSELoss()
# Using an Adam Optimizer with lr = 0.1
optimizer = torch.optim.Adam(model.parameters(),
lr = 1e-1,
weight_decay = 1e-8)
epochs = 50
outputs = []
losses = []
for epoch in range(epochs):
for (image, _) in loader:
# Output of Autoencoder
embbeding, reconstructed = model(image)
# Calculating the loss function
loss = loss_function(reconstructed, image)
# The gradients are set to zero,
# the the gradient is computed and stored.
# .step() performs parameter update
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Storing the losses in a list for plotting
losses.append(loss)
if epoch == 49:
outputs.append(embbeding)
But then I am feeding an MLP with the outcome of the autoencoder and this is where things starts to fail
class Feedforward(torch.nn.Module):
def __init__(self):
super().__init__()
self.neural = torch.nn.Sequential(
torch.nn.Linear(33, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 450),
torch.nn.ReLU(),
torch.nn.Linear(450, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 1),
torch.nn.Sigmoid()
)
def forward(self, x):
outcome = self.neural(x.float())
return outcome
modelz = Feedforward()
criterion = torch.nn.BCELoss()
opt = torch.optim.Adam(modelz.parameters(), lr = 0.01)
modelz.train()
epoch = 20
for epoch in range(epoch):
opt.zero_grad()
# Forward pass
y_pred = modelz(x_train)
# Compute Loss
loss_2 = criterion(y_pred.squeeze(), torch.tensor(y_train).to(torch.float32))
#print('Epoch {}: train loss: {}'.format(epoch, loss.item()))
# Backward pass
loss_2.backward()
opt.step()
I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.
Of course I have tried to add "retain_graph=True" to both backwards or only the first one but it does not seem to solve the problem. If I launch both code independently from another It works but as a sequence I don't know why but it is not.
You should be able to disconnect the output of the auto-encoder from the model by calling embbeding.detach(), before appending it to outputs.
I am trying to create a dataset for audio recognition with a simple Keras sequential model.
This is the function I am using to create the model:
def dnn_model(input_shape, output_shape):
model = keras.Sequential()
model.add(keras.Input(input_shape))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation = "relu"))
model.add(layers.Dense(output_shape, activation = "softmax"))
model.compile( optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=['acc'])
model.summary()
return model
And I am Generating my trainingsdata with this Generator function:
def generator(x_dirs, y_dirs, hmm, sampling_rate, parameters):
window_size_samples = tools.sec_to_samples(parameters['window_size'], sampling_rate)
window_size_samples = 2**tools.next_pow2(window_size_samples)
hop_size_samples = tools.sec_to_samples(parameters['hop_size'],sampling_rate)
for i in range(len(x_dirs)):
features = fe.compute_features_with_context(x_dirs[i],**parameters)
praat = tools.praat_file_to_target( y_dirs[i],
sampling_rate,
window_size_samples,
hop_size_samples,
hmm)
yield features,praat
The variables x_dirs and y_dirs contain a list of paths to labels and audiofiles. In total I got 8623 files to train my model. This is how I train my model:
def train_model(model, model_dir, x_dirs, y_dirs, hmm, sampling_rate, parameters, steps_per_epoch=10,epochs=10):
model.fit((generator(x_dirs, y_dirs, hmm, sampling_rate, parameters)),
epochs=epochs,
batch_size=steps_per_epoch)
return model
My problem now is that if I pass all 8623 files it will use all 8623 files to train the model in the first epoch and complain after the first epoch that it needs steps_per_epoch * epochs batches to train the model.
I tested this with only 10 of the 8623 files with a sliced list, but than Tensorflow complains that there are needed 100 batches.
So how do I have my Generator yield out data that its works best? I always thought that steps_per_epoch just limits the data received per epoch.
The fit function is going to exhaust your generator, that is to say, once it will have yielded all your 8623 batches, it wont be able to yield batches anymore.
You want to solve the issue like this:
def generator(x_dirs, y_dirs, hmm, sampling_rate, parameters, epochs=1):
for epoch in range(epochs): # or while True:
window_size_samples = tools.sec_to_samples(parameters['window_size'], sampling_rate)
window_size_samples = 2**tools.next_pow2(window_size_samples)
hop_size_samples = tools.sec_to_samples(parameters['hop_size'],sampling_rate)
for i in range(len(x_dirs)):
features = fe.compute_features_with_context(x_dirs[i],**parameters)
praat = tools.praat_file_to_target( y_dirs[i],
sampling_rate,
window_size_samples,
hop_size_samples,
hmm)
yield features,praat
I am confused with the .trainable statement of tf.keras.model in the implementation of a GAN.
Given following code snipped (taken from this repo):
class GAN():
def __init__(self):
...
# Build and compile the discriminator
self.discriminator = self.build_discriminator()
self.discriminator.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
# Build the generator
self.generator = self.build_generator()
# The generator takes noise as input and generates imgs
z = Input(shape=(self.latent_dim,))
img = self.generator(z)
# For the combined model we will only train the generator
self.discriminator.trainable = False
# The discriminator takes generated images as input and determines validity
validity = self.discriminator(img)
# The combined model (stacked generator and discriminator)
# Trains the generator to fool the discriminator
self.combined = Model(z, validity)
self.combined.compile(loss='binary_crossentropy', optimizer=optimizer)
def build_generator(self):
...
return Model(noise, img)
def build_discriminator(self):
...
return Model(img, validity)
def train(self, epochs, batch_size=128, sample_interval=50):
# Load the dataset
(X_train, _), (_, _) = mnist.load_data()
# Adversarial ground truths
valid = np.ones((batch_size, 1))
fake = np.zeros((batch_size, 1))
for epoch in range(epochs):
# ---------------------
# Train Discriminator
# ---------------------
# Select a random batch of images
idx = np.random.randint(0, X_train.shape[0], batch_size)
imgs = X_train[idx]
noise = np.random.normal(0, 1, (batch_size, self.latent_dim))
# Generate a batch of new images
gen_imgs = self.generator.predict(noise)
# Train the discriminator
d_loss_real = self.discriminator.train_on_batch(imgs, valid)
d_loss_fake = self.discriminator.train_on_batch(gen_imgs, fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# ---------------------
# Train Generator
# ---------------------
noise = np.random.normal(0, 1, (batch_size, self.latent_dim))
# Train the generator (to have the discriminator label samples as valid)
g_loss = self.combined.train_on_batch(noise, valid)
during the definition of the model self.combined the weights of the discriminator are set to self.discriminator.trainable = False but never turned back on.
Still, during the training loop the weights of the discriminator will change for the lines:
# Train the discriminator
d_loss_real = self.discriminator.train_on_batch(imgs, valid)
d_loss_fake = self.discriminator.train_on_batch(gen_imgs, fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
and will stay constant during:
# Train the generator (to have the discriminator label samples as valid)
g_loss = self.combined.train_on_batch(noise, valid)
which I wouldn't expect.
Of course this the correct (iterative) way to train a GAN, but I don't understand why we don't have to pass self.discriminator.trainable = True before we can do some training on the discriminator.
Would be nice If someone has a explanation for that, I guess that is a crucial point to understand.
Its usually a good idea to check the issues (both open and closed) when you have a question about code in a github repo. This issue explains why the flag is set to False. It says,
Since self.discriminator.trainable = False is set after the discriminator is compiled, it will not affect the training of the discriminator. However since it is set before the combined model is compiled the discriminator layers will be frozen when the combined model is trained.
And also talks about freezing keras layers.
Given input features as such, just raw numbers:
tensor([0.2153, 0.2190, 0.0685, 0.2127, 0.2145, 0.1260, 0.1480, 0.1483, 0.1489,
0.1400, 0.1906, 0.1876, 0.1900, 0.1925, 0.0149, 0.1857, 0.1871, 0.2715,
0.1887, 0.1804, 0.1656, 0.1665, 0.1137, 0.1668, 0.1168, 0.0278, 0.1170,
0.1189, 0.1163, 0.2337, 0.2319, 0.2315, 0.2325, 0.0519, 0.0594, 0.0603,
0.0586, 0.0067, 0.0624, 0.2691, 0.0617, 0.2790, 0.2805, 0.2848, 0.2454,
0.1268, 0.2483, 0.2454, 0.2475], device='cuda:0')
And the expected output is a single real number output, e.g.
tensor(-34.8500, device='cuda:0')
Full code on https://www.kaggle.com/alvations/pytorch-mlp-regression
I've tried creating a simple 2 layer network with:
class MLP(nn.Module):
def __init__(self, input_size, output_size, hidden_size):
super(MLP, self).__init__()
self.linear = nn.Linear(input_size, hidden_size)
self.classifier = nn.Linear(hidden_size, output_size)
def forward(self, inputs, hidden=None, dropout=0.5):
inputs = F.dropout(inputs, dropout) # Drop-in.
# First Layer.
output = F.relu(self.linear(inputs))
# Matrix manipulation magic.
batch_size, sequence_len, hidden_size = output.shape
# Technically, linear layer takes a 2-D matrix as input, so more manipulation...
output = output.contiguous().view(batch_size * sequence_len, hidden_size)
# Apply dropout.
output = F.dropout(output, dropout)
# Put it through the classifier
# And reshape it to [batch_size x sequence_len x vocab_size]
output = self.classifier(output).view(batch_size, sequence_len, -1)
return output
And training as such:
# Training routine.
def train(num_epochs, dataloader, valid_dataset, model, criterion, optimizer):
losses = []
valid_losses = []
learning_rates = []
plt.ion()
x_valid, y_valid = valid_dataset
for _e in range(num_epochs):
for batch in tqdm(dataloader):
# Zero gradient.
optimizer.zero_grad()
#print(batch)
this_x = torch.tensor(batch['x'].view(len(batch['x']), 1, -1)).to(device)
this_y = torch.tensor(batch['y'].view(len(batch['y']), 1, 1)).to(device)
# Feed forward.
output = model(this_x)
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, this_y.view(len(batch['y']), -1))
loss.backward()
optimizer.step()
losses.append(torch.sqrt(loss.float()).data)
with torch.no_grad():
# Zero gradient.
optimizer.zero_grad()
output = model(x_valid.view(len(x_valid), 1, -1))
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, y_valid.view(len(y_valid), -1))
valid_losses.append(torch.sqrt(loss.float()).data)
clear_output(wait=True)
plt.plot(losses, label='Train')
plt.plot(valid_losses, label='Valid')
plt.legend()
plt.pause(0.05)
Tuning several hyperparameters, it looks like the model doesn't train well, the validation loss doesn't move at all e.g.
hyperparams = Hyperparams(input_size=train_dataset.x.shape[1],
output_size=1,
hidden_size=150,
loss_func=nn.MSELoss,
learning_rate=1e-8,
optimizer=optim.Adam,
batch_size=500)
And it's loss curve:
Any idea what's wrong with the network?
Am I training the regression model with the wrong loss? Or I've just not yet found the right hyperparameters?
Or am I validating the model wrongly?
From the code you provided, it is tough to say why the validation loss is constant but I see several problems in your code.
Why do you validate for each training mini-batch? Instead, you should validate your model after you do the training for one complete epoch (iterating over your full dataset once). So, the skeleton should be like:
for _e in range(num_epochs):
for batch in tqdm(train_dataloader):
# training code
with torch.no_grad():
for batch in tqdm(valid_dataloader):
# validation code
# plot your loss values
Also, you can plot after each epoch, not after each mini-batch training.
Did you check whether the model parameters are getting updated after optimizer.step() during training? How many validation examples do you have? Why don't you use mini-batch computation during validation?
Why do you do: optimizer.zero_grad() during validation? It doesn't make sense because, during validation, you are not going to do anything related to optimization.
You should use model.eval() during validation to turn off the dropouts. See PyTorch documentation to learn about .train() and .eval() methods.
The learning rate is set to 1e-8, isn't it too small? Why don't you use the default learning rate for Adam (1e-3)?
The following requires some reasoning.
Why are you using such a large batch size? What is your training dataset size?
You can directly plot the MSELoss, instead of taking the square root.
My suggestion would be: use some existing resources for MLP in PyTorch. Don't do it from scratch if you do not have sufficient knowledge at this point. It would make you suffer a lot.