Related
I am building a simple autoencoder followed by an MLP neural nets. Regarging the autoencoder I am not running into any problem
# ---- Prepare training set ----
x_data = train_set_categorical.drop(["churn"], axis=1).to_numpy()
labels = train_set_categorical.loc[:, "churn"].to_numpy()
dataset = TensorDataset(torch.Tensor(x_data), torch.Tensor(labels) )
loader = DataLoader(dataset, batch_size=127)
# ---- Model Initialization ----
model = AE()
# Validation using MSE Loss function
loss_function = nn.MSELoss()
# Using an Adam Optimizer with lr = 0.1
optimizer = torch.optim.Adam(model.parameters(),
lr = 1e-1,
weight_decay = 1e-8)
epochs = 50
outputs = []
losses = []
for epoch in range(epochs):
for (image, _) in loader:
# Output of Autoencoder
embbeding, reconstructed = model(image)
# Calculating the loss function
loss = loss_function(reconstructed, image)
# The gradients are set to zero,
# the the gradient is computed and stored.
# .step() performs parameter update
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Storing the losses in a list for plotting
losses.append(loss)
if epoch == 49:
outputs.append(embbeding)
But then I am feeding an MLP with the outcome of the autoencoder and this is where things starts to fail
class Feedforward(torch.nn.Module):
def __init__(self):
super().__init__()
self.neural = torch.nn.Sequential(
torch.nn.Linear(33, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 450),
torch.nn.ReLU(),
torch.nn.Linear(450, 260),
torch.nn.ReLU(),
torch.nn.Linear(260, 1),
torch.nn.Sigmoid()
)
def forward(self, x):
outcome = self.neural(x.float())
return outcome
modelz = Feedforward()
criterion = torch.nn.BCELoss()
opt = torch.optim.Adam(modelz.parameters(), lr = 0.01)
modelz.train()
epoch = 20
for epoch in range(epoch):
opt.zero_grad()
# Forward pass
y_pred = modelz(x_train)
# Compute Loss
loss_2 = criterion(y_pred.squeeze(), torch.tensor(y_train).to(torch.float32))
#print('Epoch {}: train loss: {}'.format(epoch, loss.item()))
# Backward pass
loss_2.backward()
opt.step()
I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.
Of course I have tried to add "retain_graph=True" to both backwards or only the first one but it does not seem to solve the problem. If I launch both code independently from another It works but as a sequence I don't know why but it is not.
You should be able to disconnect the output of the auto-encoder from the model by calling embbeding.detach(), before appending it to outputs.
UPDATE: To solve this, I kept the checkpoint structure the same but wrote a custom train_step function, with the help of the repo linked in the accepted answer of the question linked below, which calculated the gradients and used apply_weights rather than compiling the model and using train_on_batch. This lets the full GAN state be restored. Sadly, with this method I'm fairly sure the dropout layers no longer work as the discriminator is able to work perfectly very early in the training which prevents the model from training properly. Nevertheless, the original problem is solved.
Original:
I am currently training a GAN in keras and trying to make it so that I can save the model and resume training later. Ordinarily in keras you'd simply use model.save(), however for a GAN if the discriminator and GAN (combined generator and discriminator, with discriminator weights not trainable) models are saved and loaded separately then the link between them is broken and the GAN will not function as expected. Someone asked a similar question here, How to save and resume training a GAN with multiple model parts with Tensorflow 2/ Keras, and was told to use tf.train.Checkpoint instead to save the full model at once as a checkpoint.
I've tried implementing this as follows:
def train(epochs, batch_size):
checkpoint = tf.train.Checkpoint(g_optimizer=g_optimizer,
d_optimizer=d_optimizer,
generator=generator,
discriminator=discriminator,
gan=gan
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, 'checkpoints', max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
for epoch in epochs:
#train discriminator...
#train generator...
ckpt_manager.save()
where g_optimizer, d_optimizer are just tf.keras.optimizers.Adam objects and generator, discriminator and gan are tf.keras.Model objects.
When I use this approach, the link between the gan model and the discriminator is preserved after loading in the checkpoint. The training works normally at first, but after I stop and then resume training using the checkpoint the discriminator loss starts massively increasing and the generated data becomes nonsensical.
Recompiling the models are loading the checkpoint like this was only way I could think of doing it which uses the last state of the optimizer, but clearly something isn't right - rather than resuming the training from where it was, this approach is massively disrupting the training.
Have I used tf.train.Checkpoint incorrectly for what I'm trying to do? Please let me know if there's any more information you need to be able to address the question.
Edit, have added full code by request:
Here is the code that creates the models in the first place and then trains them, in this setup the models are compiled initially when first created, and then compiled again if resuming from a checkpoint using the latest optimizer state. I appreciate it's weird to compile twice but I couldn't think of another way to use the latest optimizer state from the checkpoint, if there's a better way I'm very happy to change it. Note, the unusual GRU-based GAN is because I'm testing out being able to generate variable length time-series. There's a lot of data specific stuff in there but hopefully on the whole it makes sense. train_df is just a pandas DataFrame containing all the training data
def build_generator():
input = Input(shape=(None, latent_dims))
gru1 = GRU(100, activation='relu', return_sequences=True)(input)
gru2 = GRU(100, activation='relu', return_sequences=True (gru1)
output = GRU(9, return_sequences=True, activation='sigmoid')(gru2)
model = Model(input, output)
return model
def build_discriminator():
input = Input(shape=(None, 9))
gru1 = GRU(100, return_sequences=True)(input)
gru2 = GRU(100, return_sequences=True)(gru1)
output = GRU(1, activation='sigmoid')(gru2)
model = Model(input, output)
return model
d_optimizer = opt.Adam(learning_rate=lr)
g_optimizer = opt.Adam(learning_rate=lr)
# Build discriminator
discriminator = build_discriminator()
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
# Build generator
generator = build_generator()
# Build combined model
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
def train(epochs, batch_size=1): #Only works with batch size of 1 currently
sne = train_df.sn.unique()
n_batches = int(len(sne) / batch_size)
rng = np.random.default_rng(123)
checkpoint = tf.train.Checkpoint(g_optimizer=g_optimizer,
d_optimizer=d_optimizer,
generator=generator,
discriminator=discriminator,
gan=gan
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, 'checkpoints', max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
discriminator.compile(loss='binary_crossentropy', optimizer=d_optimizer)
i = Input(shape=(None, latent_dims))
lcs = generator(i)
discriminator.trainable = False
valid = discriminator(lcs)
gan = Model(i, valid)
gan.compile(loss='binary_crossentropy', optimizer=g_optimizer)
for epoch in range(epochs):
rng.shuffle(sne)
g_losses, d_losses = [], []
for batch in range(n_batches):
real = np.random.uniform(0.0, 0.1, (batch_size, 1)) # Used instead of np.zeros to avoid zero gradients
fake = np.random.uniform(0.9, 1.0, (batch_size, 1)) # Used instead of np.ones to avoid zero gradients
# Select real data
sn = sne[batch]
sndf = train_df[train_df.sn == sn]
X = sndf[['g_t', 'r_t', 'i_t', 'z_t', 'g', 'r', 'i', 'z', 'g_err', 'r_err', 'i_err', 'z_err']].values
X = X.reshape((1, *X.shape))
noise = rand.normal(size=(batch_size, latent_dims))
noise = np.reshape(noise, (batch_size, 1, latent_dims))
noise = np.repeat(noise, X.shape[1], 1)
gen_lcs = generator.predict(noise)
# Train discriminator
d_loss_real = discriminator.train_on_batch(X, real)
d_loss_fake = discriminator.train_on_batch(gen_lcs, fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# Train generator
noise = rand.normal(size=(2 * batch_size, latent_dims))
noise = np.reshape(noise, (2 * batch_size, 1, latent_dims))
noise = np.repeat(noise, X.shape[1], 1)
gen_labels = np.zeros((2 * batch_size, 1))
g_loss = gan.train_on_batch(noise, gen_labels)
g_losses.append(g_loss)
d_losses.append(d_loss)
ckpt_manager.save()
full_g_loss = np.mean(g_losses)
full_d_loss = np.mean(d_losses)
print(f'{epoch + 1}/{epochs} g_loss={full_g_loss}, d_loss={full_d_loss})
train()
If you have the following checkpoint structure, your model should work properly:
checkpoint_dir = 'checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(generator_opt=generator_opt,
discriminator_opt=discriminator_opt,
gan_opt=gan_opt,
generator=generator,
discriminator=discriminator,
GAN = GAN
)
ckpt_manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
print ('Latest checkpoint restored!!')
Note that the GAN model has its own optimizer. And then in your training loop, just save checkpoints at certain intervals, for example every 10 epochs.
for epoch in range(epochs):
...
...
...
if epoch%10 == 0:
ckpt_manager.save()
I have the below code for a binary classification and it works fine but i would like to modify the nn.Sequential parameters and add an BiLSTM layer. I have the below code:
class BertClassifier(nn.Module):
def __init__(self, freeze_bert=False):
super(BertClassifier, self).__init__()
# Specify hidden size of BERT, hidden size of our classifier, and number of labels
D_in, H, D_out = 768, 50, 2
# Instantiate BERT model
self.bert = BertModel.from_pretrained('bert-base-multilingual-uncased')
# Instantiate an one-layer feed-forward classifier
self.classifier = nn.Sequential(nn.Linear(D_in, H),nn.ReLU(),nn.Linear(H, D_out))
# Freeze the BERT model
if freeze_bert:
for param in self.bert.parameters():
param.requires_grad = False
def forward(self, input_ids, attention_mask):
# Feed input to BERT
outputs = self.bert(input_ids=input_ids,attention_mask=attention_mask)
# Extract the last hidden state of the token `[CLS]` for classification task
last_hidden_state_cls = outputs[0][:, 0, :]
# Feed input to classifier to compute logits
logits = self.classifier(last_hidden_state_cls)
return logits
I have tried to modify the sequential like this self.classifier = nn.Sequential(nn.LSTM(D_in, H, batch_first=True, bidirectional=True),nn.ReLU(),nn.Linear(H, D_out)) but then it throws the error RuntimeError: input must have 3 dimensions, got 2 on line logits = self.classifier(last_hidden_state_cls). I found that I can use nn.ModuleDict instead of nn.Sequential and i made the below :
self.classifier = nn.ModuleDict({
'lstm': nn.LSTM(input_size=D_in, hidden_size=H,batch_first=True, bidirectional=True ),
'linear': nn.Linear(in_features=H,out_features=D_out)})
But now I'm having issues computing the forward function with this. Can someone advice how i can properly modify the forward function?
Update: I also installed CUDA and now when I run the code it returns the error CUDA out of memory. Tried to allocate 16.00 MiB and I tried to lower the batch size but that doesn't fix the problem. I also tried the below but didn't resolved either. Any advice, please?
import torch, gc
gc.collect()
torch.cuda.empty_cache()
Update with the code:
MAX_LEN = 64
# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32
VALID_BATCH_SIZE = 4
file1 = open('MH.txt', 'r')
list_com = []
list_label = []
for line in file1:
possible_labels = 'positive|negative'
label = re.findall(possible_labels, line)
line = re.sub(possible_labels, ' ', line)
line = re.sub('\n', ' ', line)
list_com.append(line)
list_label.append(label[0])
list_tuples = list(zip(list_com, list_label))
file1.close()
labels = ['positive', 'negative']
df = pd.DataFrame(list_tuples, columns=['text', 'label'])
df['label'] = df['label'].map({'positive': 1, 'negative': 0})
for i in range(0,len(df['label'])):
list_label[i] = df['label'][i]
#print(df)
#print(df['label'].value_counts())
X = df.text.values
y = df.label.values
X_train, X_val, y_train, y_val =\
train_test_split(X, y, test_size=0.1, random_state=2020)
def text_preprocessing(text):
# Remove '#name'
text = re.sub(r'(#.*?)[\s]', ' ', text)
# Replace '&' with '&'
text = re.sub(r'&', '&', text)
# Remove trailing whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)
# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
input_ids = []
attention_masks = []
for sent in data:
encoded_sent = tokenizer.encode_plus(
text=text_preprocessing(sent), # Preprocess sentence
add_special_tokens=True, # Add `[CLS]` and `[SEP]`
max_length=MAX_LEN, # Max length to truncate/pad
pad_to_max_length=True, # Pad sentence to max length
# return_tensors='pt', # Return PyTorch tensor
return_attention_mask=True # Return attention mask
)
# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))
attention_masks.append(encoded_sent.get('attention_mask'))
# Convert lists to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
return input_ids, attention_masks
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)
# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)
# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
# Create the BertClassfier class
class BertClassifier(nn.Module):
"""Bert Model for Classification Tasks."""
def __init__(self, freeze_bert=False):
"""
#param bert: a BertModel object
#param classifier: a torch.nn.Module classifier
#param freeze_bert (bool): Set `False` to fine-tune the BERT model
"""
super(BertClassifier, self).__init__()
# Specify hidden size of BERT, hidden size of our classifier, and number of labels
D_in, H, D_out = 768, 50, 2
# Instantiate BERT model
self.bert = BertModel.from_pretrained('bert-base-multilingual-uncased')
# Instantiate an one-layer feed-forward classifier
self.classifier = nn.ModuleDict({
'lstm': nn.LSTM(input_size=D_in, hidden_size=H, batch_first=True, bidirectional=True),
'linear': nn.Linear(in_features=H, out_features=D_out)})
# Freeze the BERT model
if freeze_bert:
for param in self.bert.parameters():
param.requires_grad = False
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids,attention_mask=attention_mask)
sequence_output = outputs[0]
sequence_output, _ = self.lstm(sequence_output)
linear_output = self.linear(sequence_output[:, -1])
return linear_output
def initialize_model(epochs=4):
# Instantiate Bert Classifier
bert_classifier = BertClassifier(freeze_bert=False)
print(bert_classifier)
# Tell PyTorch to run the model on GPU
bert_classifier.to(device)
# Create the optimizer
optimizer = AdamW(bert_classifier.parameters(), lr=5e-5)
# Total number of training steps
total_steps = len(train_dataloader) * epochs
# Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps=total_steps)
return bert_classifier, optimizer, scheduler
# Specify loss function
loss_fn = nn.CrossEntropyLoss()
def set_seed(seed_value=42):
"""Set seed for reproducibility."""
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
"""Train the BertClassifier model."""
# Start training loop
print("Start training...\n")
for epoch_i in range(epochs):
# Print the header of the result table
print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
print("-" * 70)
# Measure the elapsed time of each epoch
t0_epoch, t0_batch = time.time(), time.time()
# Reset tracking variables at the beginning of each epoch
total_loss, batch_loss, batch_counts = 0, 0, 0
# Put the model into the training mode
model.train()
# For each batch of training data...
for step, batch in enumerate(train_dataloader):
batch_counts += 1
# Load batch to GPU
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
# Zero out any previously calculated gradients
model.zero_grad()
# Perform a forward pass. This will return logits.
logits = model(b_input_ids, b_attn_mask)
# Compute loss and accumulate the loss values
loss = loss_fn(logits, b_labels)
batch_loss += loss.item()
total_loss += loss.item()
# Perform a backward pass to calculate gradients
loss.backward()
# Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters and the learning rate
optimizer.step()
scheduler.step()
# Print the loss values and time elapsed for every 20 batches
if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
# Calculate time elapsed for 20 batches
time_elapsed = time.time() - t0_batch
# Print training results
print(
f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")
# Reset batch tracking variables
batch_loss, batch_counts = 0, 0
t0_batch = time.time()
# Calculate the average loss over the entire training data
avg_train_loss = total_loss / len(train_dataloader)
print("-" * 70)
#Evaluation
if evaluation == True:
# After the completion of each training epoch, measure the model's performance
# on our validation set.
val_loss, val_accuracy = evaluate(model, val_dataloader)
# Print performance over the entire training data
time_elapsed = time.time() - t0_epoch
print(
f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
print("-" * 70)
print("\n")
print("Training complete!")
def evaluate(model, val_dataloader):
"""After the completion of each training epoch, measure the model's performance
on our validation set.
"""
# Put the model into the evaluation mode. The dropout layers are disabled during
# the test time.
model.eval()
# Tracking variables
val_accuracy = []
val_loss = []
# For each batch in our validation set...
for batch in val_dataloader:
# Load batch to GPU
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
# Compute logits
with torch.no_grad():
logits = model(b_input_ids, b_attn_mask)
# Compute loss
loss = loss_fn(logits, b_labels)
val_loss.append(loss.item())
# Get the predictions
preds = torch.argmax(logits, dim=1).flatten()
# Calculate the accuracy rate
accuracy = (preds == b_labels).cpu().numpy().mean() * 100
val_accuracy.append(accuracy)
# Compute the average accuracy and loss over the validation set.
val_loss = np.mean(val_loss)
val_accuracy = np.mean(val_accuracy)
return val_loss, val_accuracy
def accuracy(probs, y_true):
"""
- Print AUC and accuracy on the test set
#params probs (np.array): an array of predicted probabilities with shape (len(y_true), 2)
#params y_true (np.array): an array of the true values with shape (len(y_true),)
fpr, tpr, threshold = roc_curve(y_true, preds)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.4f}')
"""
preds = probs[:, 1]
# Get accuracy over the test set
y_pred = np.where(preds >= 0.5, 1, 0)
accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
def bert_predict(model, test_dataloader):
"""Perform a forward pass on the trained BERT model to predict probabilities on the test set."""
# Put the model into the evaluation mode. The dropout layers are disabled during the test time.
model.eval()
all_logits = []
# For each batch in our test set...
for batch in test_dataloader:
# Load batch to GPU
b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]
# Compute logits
with torch.no_grad():
logits = model(b_input_ids, b_attn_mask)
all_logits.append(logits)
# Concatenate logits from each batch
all_logits = torch.cat(all_logits, dim=0)
# Apply softmax to calculate probabilities
probs = F.softmax(all_logits, dim=1).cpu().numpy()
return probs
set_seed(42) # Set seed for reproducibility
bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
# start training
train(bert_classifier, train_dataloader, val_dataloader, epochs=3, evaluation=True)
# Compute predicted probabilities on the test set
probs = bert_predict(bert_classifier, val_dataloader)
# Evaluate the Bert classifier
accuracy(probs, y_val)
Given input features as such, just raw numbers:
tensor([0.2153, 0.2190, 0.0685, 0.2127, 0.2145, 0.1260, 0.1480, 0.1483, 0.1489,
0.1400, 0.1906, 0.1876, 0.1900, 0.1925, 0.0149, 0.1857, 0.1871, 0.2715,
0.1887, 0.1804, 0.1656, 0.1665, 0.1137, 0.1668, 0.1168, 0.0278, 0.1170,
0.1189, 0.1163, 0.2337, 0.2319, 0.2315, 0.2325, 0.0519, 0.0594, 0.0603,
0.0586, 0.0067, 0.0624, 0.2691, 0.0617, 0.2790, 0.2805, 0.2848, 0.2454,
0.1268, 0.2483, 0.2454, 0.2475], device='cuda:0')
And the expected output is a single real number output, e.g.
tensor(-34.8500, device='cuda:0')
Full code on https://www.kaggle.com/alvations/pytorch-mlp-regression
I've tried creating a simple 2 layer network with:
class MLP(nn.Module):
def __init__(self, input_size, output_size, hidden_size):
super(MLP, self).__init__()
self.linear = nn.Linear(input_size, hidden_size)
self.classifier = nn.Linear(hidden_size, output_size)
def forward(self, inputs, hidden=None, dropout=0.5):
inputs = F.dropout(inputs, dropout) # Drop-in.
# First Layer.
output = F.relu(self.linear(inputs))
# Matrix manipulation magic.
batch_size, sequence_len, hidden_size = output.shape
# Technically, linear layer takes a 2-D matrix as input, so more manipulation...
output = output.contiguous().view(batch_size * sequence_len, hidden_size)
# Apply dropout.
output = F.dropout(output, dropout)
# Put it through the classifier
# And reshape it to [batch_size x sequence_len x vocab_size]
output = self.classifier(output).view(batch_size, sequence_len, -1)
return output
And training as such:
# Training routine.
def train(num_epochs, dataloader, valid_dataset, model, criterion, optimizer):
losses = []
valid_losses = []
learning_rates = []
plt.ion()
x_valid, y_valid = valid_dataset
for _e in range(num_epochs):
for batch in tqdm(dataloader):
# Zero gradient.
optimizer.zero_grad()
#print(batch)
this_x = torch.tensor(batch['x'].view(len(batch['x']), 1, -1)).to(device)
this_y = torch.tensor(batch['y'].view(len(batch['y']), 1, 1)).to(device)
# Feed forward.
output = model(this_x)
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, this_y.view(len(batch['y']), -1))
loss.backward()
optimizer.step()
losses.append(torch.sqrt(loss.float()).data)
with torch.no_grad():
# Zero gradient.
optimizer.zero_grad()
output = model(x_valid.view(len(x_valid), 1, -1))
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, y_valid.view(len(y_valid), -1))
valid_losses.append(torch.sqrt(loss.float()).data)
clear_output(wait=True)
plt.plot(losses, label='Train')
plt.plot(valid_losses, label='Valid')
plt.legend()
plt.pause(0.05)
Tuning several hyperparameters, it looks like the model doesn't train well, the validation loss doesn't move at all e.g.
hyperparams = Hyperparams(input_size=train_dataset.x.shape[1],
output_size=1,
hidden_size=150,
loss_func=nn.MSELoss,
learning_rate=1e-8,
optimizer=optim.Adam,
batch_size=500)
And it's loss curve:
Any idea what's wrong with the network?
Am I training the regression model with the wrong loss? Or I've just not yet found the right hyperparameters?
Or am I validating the model wrongly?
From the code you provided, it is tough to say why the validation loss is constant but I see several problems in your code.
Why do you validate for each training mini-batch? Instead, you should validate your model after you do the training for one complete epoch (iterating over your full dataset once). So, the skeleton should be like:
for _e in range(num_epochs):
for batch in tqdm(train_dataloader):
# training code
with torch.no_grad():
for batch in tqdm(valid_dataloader):
# validation code
# plot your loss values
Also, you can plot after each epoch, not after each mini-batch training.
Did you check whether the model parameters are getting updated after optimizer.step() during training? How many validation examples do you have? Why don't you use mini-batch computation during validation?
Why do you do: optimizer.zero_grad() during validation? It doesn't make sense because, during validation, you are not going to do anything related to optimization.
You should use model.eval() during validation to turn off the dropouts. See PyTorch documentation to learn about .train() and .eval() methods.
The learning rate is set to 1e-8, isn't it too small? Why don't you use the default learning rate for Adam (1e-3)?
The following requires some reasoning.
Why are you using such a large batch size? What is your training dataset size?
You can directly plot the MSELoss, instead of taking the square root.
My suggestion would be: use some existing resources for MLP in PyTorch. Don't do it from scratch if you do not have sufficient knowledge at this point. It would make you suffer a lot.
Hi I am trying to modify the mnist example to match it to my dataset. I only try to use the mlp example and it gives a strange error.
Tha dataset is a matrix with 2100 rows and 17 columns, and the output should be one of the 16 possible classes. The error seems happening in the secon phase of the training. The model is build correctly (log info confirmed).
Here is the error log:
ValueError: y_i value out of bounds
Apply node that caused the error:
CrossentropySoftmaxArgmax1HotWithBias(Dot22.0, b, targets)
Toposort index: 33
Inputs types: [TensorType(float64, matrix), TensorType(float64, vector), >TensorType(int32, vector)]
Inputs shapes: [(100, 17), (17,), (100,)]
Inputs strides: [(136, 8), (8,), (4,)]
Inputs values: ['not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{acc_dtype=float64}(CrossentropySoftmaxArgmax1HotWithBias.0)], [CrossentropySoftmax1HotWithBiasDx(Assert{msg='sm and dy do not have the same shape.'}.0, CrossentropySoftmaxArgmax1HotWithBias.1, targets)], []]
HINT: Re-running with most Theano optimization disabled could give you a >back-trace of when this node was created. This can be done with by >setting the Theano flag 'optimizer=fast_compile'. If that does not work, >Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Here is the code:
def build_mlp(input_var=None):
l_in = lasagne.layers.InputLayer(shape=(None, 16),
input_var=input_var)
# Apply 20% dropout to the input data:
l_in_drop = lasagne.layers.DropoutLayer(l_in, p=0.2)
# Add a fully-connected layer of 800 units, using the linear rectifier, and
# initializing weights with Glorot's scheme (which is the default anyway):
l_hid1 = lasagne.layers.DenseLayer(
l_in_drop, num_units=10,
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.GlorotUniform())
# We'll now add dropout of 50%:
l_hid1_drop = lasagne.layers.DropoutLayer(l_hid1, p=0.5)
# Another 800-unit layer:
l_hid2 = lasagne.layers.DenseLayer(
l_hid1_drop, num_units=10,
nonlinearity=lasagne.nonlinearities.rectify)
# 50% dropout again:
l_hid2_drop = lasagne.layers.DropoutLayer(l_hid2, p=0.5)
# Finally, we'll add the fully-connected output layer, of 10 softmax units:
l_out = lasagne.layers.DenseLayer(
l_hid2_drop, num_units=17,
nonlinearity=lasagne.nonlinearities.softmax)
# Each layer is linked to its incoming layer(s), so we only need to pass
# the output layer to give access to a network in Lasagne:
return l_out
def main(model='mlp', num_epochs=300):
# Load the dataset
print("Loading data...")
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()
# Prepare Theano variables for inputs and targets
input_var = T.matrix('inputs')
target_var = T.ivector('targets')
# Create neural network model (depending on first command line parameter)
print("Building model and compiling functions...")
if model == 'cnn':
network = build_cnn(input_var)
elif model == 'mlp':
network = build_mlp(input_var)
elif model == 'lstm':
network = build_lstm(input_var)
else:
print("Unrecognized model type %r." % model)
# Create a loss expression for training, i.e., a scalar objective we want
# to minimize (for our multi-class problem, it is the cross-entropy loss):
prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(prediction, target_var)
loss = loss.mean()
# We could add some weight decay as well here, see lasagne.regularization.
# Create update expressions for training, i.e., how to modify the
# parameters at each training step. Here, we'll use Stochastic Gradient
# Descent (SGD) with Nesterov momentum, but Lasagne offers plenty more.
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(
loss, params, learning_rate=0.01, momentum=0.9)
# Create a loss expression for validation/testing. The crucial difference
# here is that we do a deterministic forward pass through the network,
# disabling dropout layers.
test_prediction = lasagne.layers.get_output(network, deterministic=True)
test_loss = lasagne.objectives.categorical_crossentropy(test_prediction,
target_var)
test_loss = test_loss.mean()
# As a bonus, also create an expression for the classification accuracy:
test_acc = T.mean(T.eq(T.argmax(test_prediction, axis=1), target_var),
dtype=theano.config.floatX)
# Compile a function performing a training step on a mini-batch (by giving
# the updates dictionary) and returning the corresponding training loss:
train_fn = theano.function([input_var, target_var], loss, updates=updates)
# Compile a second function computing the validation loss and accuracy:
val_fn = theano.function([input_var, target_var], [test_loss, test_acc])
# Finally, launch the training loop.
print("Starting training...")
# We iterate over epochs:
for epoch in range(num_epochs):
# In each epoch, we do a full pass over the training data:
train_err = 0
train_batches = 0
start_time = time.time()
for batch in iterate_minibatches(X_train, y_train, 100, shuffle=True):
inputs, targets = batch
train_err += train_fn(inputs, targets)
train_batches += 1
# And a full pass over the validation data:
val_err = 0
val_acc = 0
val_batches = 0
for batch in iterate_minibatches(X_val, y_val, 100, shuffle=False):
inputs, targets = batch
err, acc = val_fn(inputs, targets)
val_err += err
val_acc += acc
val_batches += 1
# Then we print the results for this epoch:
print("Epoch {} of {} took {:.3f}s".format(
epoch + 1, num_epochs, time.time() - start_time))
print(" training loss:\t\t{:.6f}".format(train_err / train_batches))
print(" validation loss:\t\t{:.6f}".format(val_err / val_batches))
print(" validation accuracy:\t\t{:.2f} %".format(
val_acc / val_batches * 100))
# After training, we compute and print the test error:
test_err = 0
test_acc = 0
test_batches = 0
for batch in iterate_minibatches(X_test, y_test, 100, shuffle=False):
inputs, targets = batch
err, acc = val_fn(inputs, targets)
test_err += err
test_acc += acc
test_batches += 1
print("Final results:")
print(" test loss:\t\t\t{:.6f}".format(test_err / test_batches))
print(" test accuracy:\t\t{:.2f} %".format(
test_acc / test_batches * 100))
I Figured out the problem:
my dataset does not have an output for every target, becouse it is too small! There are 17 target outputs but my dataset has only 16 different outputs, and it is missing examples of the 17th output.
In order to resolve this problem, just change the softmax with rectify,
from this:
l_out = lasagne.layers.DenseLayer(
l_hid2_drop, num_units=17,
nonlinearity=lasagne.nonlinearities.softmax)
to this:
l_out = lasagne.layers.DenseLayer(
l_hid2_drop, num_units=17,
nonlinearity=lasagne.nonlinearities.rectify)