Out of Memory When Performing Gradient Accumulation in Tensorflow

Out of Memory When Performing Gradient Accumulation in Tensorflow - python

I am trying to implement gradient accumulation for a twitter sentiment analysis model using HuggingFace's BERT model. However, when I go to implement gradient accumulation with a batch size of 64, I get the dreaded "OOM" error. Oddly enough, when I go to run my same model with a batch size of 64 and not using gradient accumulation, it trains right through. Does anyone know why this is and/or if my code is wrong?
batch_size = 32
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
vocabulary = tokenizer.get_vocab()
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
# data preprocessing
tweets_pos = pd.read_csv('C:/1_Tweets.csv', sep=',', names = ['Tweet', 'Sentiment'])
tweets_neg = pd.read_csv('C:/0_Tweets.csv', sep=',', names = ['Tweet', 'Sentiment'])
data = pd.concat([tweets_pos, tweets_neg], axis=0)
data = data.sample(frac=1)
all_tweets = data['Tweet'].to_list()
all_sentiment = data['Sentiment'].to_list()
training_tweets = all_tweets[0:512]
training_labels = all_sentiment[0:512]
# create dataset
def create_dataset(tweets, labels):
inputs_ids_list = []
token_type_ids_list = []
attention_mask_list = []
label_list = []
for i in range(len(tweets)):
encoded = tokenizer.encode_plus(tweets[i], max_length = 512, pad_to_max_length=True, return_attention_mask=True, add_special_tokens=True)
inputs_ids_list.append(encoded['input_ids'])
token_type_ids_list.append(encoded['token_type_ids'])
attention_mask_list.append(encoded['attention_mask'])
label_list.append([labels[i]])
ids_and_mask = {'input_ids':inputs_ids_list, 'token_type_ids':token_type_ids_list,'attention_mask':attention_mask_list}
return tf.data.Dataset.from_tensor_slices((ids_and_mask, label_list))
# create dataset of batch_size = 32
train_dataset = create_dataset(training_tweets, training_labels).batch(batch_size)
# Accumulate Gradients
num_epochs = 1
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
# get trainable variables
train_vars = model.trainable_variables
accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]
for (batch, (tweets, labels)) in enumerate(train_dataset):
labels = tf.dtypes.cast(labels, tf.float32)
with tf.GradientTape() as tape:
prediction = model(tweets, training=True)
prediction = tf.dtypes.cast(prediction, tf.float32)
loss_value = loss(y_true=labels, y_pred=prediction)
total_loss += loss_value
# get gradients of this tape
gradients = tape.gradient(loss_value, train_vars)
# Accumulate the gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]
# average gradients and apply the optimization step
accum_gradient = [this_grad/batch_size for this_grad in accum_gradient]
optimizer.apply_gradients(zip(accum_gradient,train_vars))
epoch_loss = total_loss / batch_size
print(f'Epoch loss: {epoch_loss}')

I know I'm a bit late for this, but on your question, you are already answering it.
It holds the graph in the memory to correctly calculate the gradient when accumulating gradients. In other words, when accumulating, you have every forward that you have already done in your memory. When not accumulating and using a batch size of 64, TensorFlow flushes the graph after back propping it.
This scenario is why you can train with this batch size of 64 but not accumulate with 64. I don't know why you are trying to accumulate it. However, maybe downsizing your batch size a little if you need it to accumulate.

Related

How to add BiLSTM on top of BERT from Huggingface + CUDA out of memory. Tried to allocate 16.00 MiB

I have the below code for a binary classification and it works fine but i would like to modify the nn.Sequential parameters and add an BiLSTM layer. I have the below code:
class BertClassifier(nn.Module):
def __init__(self, freeze_bert=False):
super(BertClassifier, self).__init__()
# Specify hidden size of BERT, hidden size of our classifier, and number of labels
D_in, H, D_out = 768, 50, 2
# Instantiate BERT model
self.bert = BertModel.from_pretrained('bert-base-multilingual-uncased')
# Instantiate an one-layer feed-forward classifier
self.classifier = nn.Sequential(nn.Linear(D_in, H),nn.ReLU(),nn.Linear(H, D_out))
# Freeze the BERT model
if freeze_bert:
for param in self.bert.parameters():
param.requires_grad = False
def forward(self, input_ids, attention_mask):
# Feed input to BERT
outputs = self.bert(input_ids=input_ids,attention_mask=attention_mask)
# Extract the last hidden state of the token `[CLS]` for classification task
last_hidden_state_cls = outputs[0][:, 0, :]
# Feed input to classifier to compute logits
logits = self.classifier(last_hidden_state_cls)
return logits
I have tried to modify the sequential like this self.classifier = nn.Sequential(nn.LSTM(D_in, H, batch_first=True, bidirectional=True),nn.ReLU(),nn.Linear(H, D_out)) but then it throws the error RuntimeError: input must have 3 dimensions, got 2 on line logits = self.classifier(last_hidden_state_cls). I found that I can use nn.ModuleDict instead of nn.Sequential and i made the below :
self.classifier = nn.ModuleDict({
'lstm': nn.LSTM(input_size=D_in, hidden_size=H,batch_first=True, bidirectional=True ),
'linear': nn.Linear(in_features=H,out_features=D_out)})
But now I'm having issues computing the forward function with this. Can someone advice how i can properly modify the forward function?
Update: I also installed CUDA and now when I run the code it returns the error CUDA out of memory. Tried to allocate 16.00 MiB and I tried to lower the batch size but that doesn't fix the problem. I also tried the below but didn't resolved either. Any advice, please?
import torch, gc
gc.collect()
torch.cuda.empty_cache()
Update with the code:
MAX_LEN = 64
# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32
VALID_BATCH_SIZE = 4
file1 = open('MH.txt', 'r')
list_com = []
list_label = []
for line in file1:
possible_labels = 'positive|negative'
label = re.findall(possible_labels, line)
line = re.sub(possible_labels, ' ', line)
line = re.sub('\n', ' ', line)
list_com.append(line)
list_label.append(label[0])
list_tuples = list(zip(list_com, list_label))
file1.close()
labels = ['positive', 'negative']
df = pd.DataFrame(list_tuples, columns=['text', 'label'])
df['label'] = df['label'].map({'positive': 1, 'negative': 0})
for i in range(0,len(df['label'])):
list_label[i] = df['label'][i]
#print(df)
#print(df['label'].value_counts())
X = df.text.values
y = df.label.values
X_train, X_val, y_train, y_val =\
train_test_split(X, y, test_size=0.1, random_state=2020)
def text_preprocessing(text):
# Remove '#name'
text = re.sub(r'(#.*?)[\s]', ' ', text)
# Replace '&' with '&'
text = re.sub(r'&', '&', text)
# Remove trailing whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)
# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
input_ids = []
attention_masks = []
for sent in data:
encoded_sent = tokenizer.encode_plus(
text=text_preprocessing(sent), # Preprocess sentence
add_special_tokens=True, # Add `[CLS]` and `[SEP]`
max_length=MAX_LEN, # Max length to truncate/pad
pad_to_max_length=True, # Pad sentence to max length
# return_tensors='pt', # Return PyTorch tensor
return_attention_mask=True # Return attention mask
)
# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))
attention_masks.append(encoded_sent.get('attention_mask'))
# Convert lists to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
return input_ids, attention_masks
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)
# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)
# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
# Create the BertClassfier class
class BertClassifier(nn.Module):
"""Bert Model for Classification Tasks."""
def __init__(self, freeze_bert=False):
"""
#param bert: a BertModel object
#param classifier: a torch.nn.Module classifier
#param freeze_bert (bool): Set `False` to fine-tune the BERT model
"""
super(BertClassifier, self).__init__()
# Specify hidden size of BERT, hidden size of our classifier, and number of labels
D_in, H, D_out = 768, 50, 2
# Instantiate BERT model
self.bert = BertModel.from_pretrained('bert-base-multilingual-uncased')
# Instantiate an one-layer feed-forward classifier
self.classifier = nn.ModuleDict({
'lstm': nn.LSTM(input_size=D_in, hidden_size=H, batch_first=True, bidirectional=True),
'linear': nn.Linear(in_features=H, out_features=D_out)})
# Freeze the BERT model
if freeze_bert:
for param in self.bert.parameters():
param.requires_grad = False
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids,attention_mask=attention_mask)
sequence_output = outputs[0]
sequence_output, _ = self.lstm(sequence_output)
linear_output = self.linear(sequence_output[:, -1])
return linear_output
def initialize_model(epochs=4):
# Instantiate Bert Classifier
bert_classifier = BertClassifier(freeze_bert=False)
print(bert_classifier)
# Tell PyTorch to run the model on GPU
bert_classifier.to(device)
# Create the optimizer
optimizer = AdamW(bert_classifier.parameters(), lr=5e-5)
# Total number of training steps
total_steps = len(train_dataloader) * epochs
# Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps=total_steps)
return bert_classifier, optimizer, scheduler
# Specify loss function
loss_fn = nn.CrossEntropyLoss()
def set_seed(seed_value=42):
"""Set seed for reproducibility."""
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
"""Train the BertClassifier model."""
# Start training loop
print("Start training...\n")
for epoch_i in range(epochs):
# Print the header of the result table
print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
print("-" * 70)
# Measure the elapsed time of each epoch
t0_epoch, t0_batch = time.time(), time.time()
# Reset tracking variables at the beginning of each epoch
total_loss, batch_loss, batch_counts = 0, 0, 0
# Put the model into the training mode
model.train()
# For each batch of training data...
for step, batch in enumerate(train_dataloader):
batch_counts += 1
# Load batch to GPU
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
# Zero out any previously calculated gradients
model.zero_grad()
# Perform a forward pass. This will return logits.
logits = model(b_input_ids, b_attn_mask)
# Compute loss and accumulate the loss values
loss = loss_fn(logits, b_labels)
batch_loss += loss.item()
total_loss += loss.item()
# Perform a backward pass to calculate gradients
loss.backward()
# Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters and the learning rate
optimizer.step()
scheduler.step()
# Print the loss values and time elapsed for every 20 batches
if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
# Calculate time elapsed for 20 batches
time_elapsed = time.time() - t0_batch
# Print training results
print(
f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")
# Reset batch tracking variables
batch_loss, batch_counts = 0, 0
t0_batch = time.time()
# Calculate the average loss over the entire training data
avg_train_loss = total_loss / len(train_dataloader)
print("-" * 70)
#Evaluation
if evaluation == True:
# After the completion of each training epoch, measure the model's performance
# on our validation set.
val_loss, val_accuracy = evaluate(model, val_dataloader)
# Print performance over the entire training data
time_elapsed = time.time() - t0_epoch
print(
f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
print("-" * 70)
print("\n")
print("Training complete!")
def evaluate(model, val_dataloader):
"""After the completion of each training epoch, measure the model's performance
on our validation set.
"""
# Put the model into the evaluation mode. The dropout layers are disabled during
# the test time.
model.eval()
# Tracking variables
val_accuracy = []
val_loss = []
# For each batch in our validation set...
for batch in val_dataloader:
# Load batch to GPU
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
# Compute logits
with torch.no_grad():
logits = model(b_input_ids, b_attn_mask)
# Compute loss
loss = loss_fn(logits, b_labels)
val_loss.append(loss.item())
# Get the predictions
preds = torch.argmax(logits, dim=1).flatten()
# Calculate the accuracy rate
accuracy = (preds == b_labels).cpu().numpy().mean() * 100
val_accuracy.append(accuracy)
# Compute the average accuracy and loss over the validation set.
val_loss = np.mean(val_loss)
val_accuracy = np.mean(val_accuracy)
return val_loss, val_accuracy
def accuracy(probs, y_true):
"""
- Print AUC and accuracy on the test set
#params probs (np.array): an array of predicted probabilities with shape (len(y_true), 2)
#params y_true (np.array): an array of the true values with shape (len(y_true),)
fpr, tpr, threshold = roc_curve(y_true, preds)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.4f}')
"""
preds = probs[:, 1]
# Get accuracy over the test set
y_pred = np.where(preds >= 0.5, 1, 0)
accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
def bert_predict(model, test_dataloader):
"""Perform a forward pass on the trained BERT model to predict probabilities on the test set."""
# Put the model into the evaluation mode. The dropout layers are disabled during the test time.
model.eval()
all_logits = []
# For each batch in our test set...
for batch in test_dataloader:
# Load batch to GPU
b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]
# Compute logits
with torch.no_grad():
logits = model(b_input_ids, b_attn_mask)
all_logits.append(logits)
# Concatenate logits from each batch
all_logits = torch.cat(all_logits, dim=0)
# Apply softmax to calculate probabilities
probs = F.softmax(all_logits, dim=1).cpu().numpy()
return probs
set_seed(42) # Set seed for reproducibility
bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
# start training
train(bert_classifier, train_dataloader, val_dataloader, epochs=3, evaluation=True)
# Compute predicted probabilities on the test set
probs = bert_predict(bert_classifier, val_dataloader)
# Evaluate the Bert classifier
accuracy(probs, y_val)

the Training is stopped for my PyTorch model

I am creating a model to identify names of items in RNN (LITMS)
i get data , then transform data then create Baches then create Model then create train function Correctly but the training is stop here(not working):
this this my code
for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)
for x, y in get_batches(data, batch_size, seq_length):
print ("the login the loob get_batches is succressfuly")
counter += 1
# One-hot encode our data and make them Torch tensors
x = one_hot_encode(x, n_chars)
inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
if(train_on_gpu):
inputs, targets = inputs.cuda(), targets.cuda()
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([each.data for each in h])
# zero accumulated gradients
net.zero_grad()
# get the output from the model
output, h = net(inputs, h)
# calculate the loss and perform backprop
loss = criterion(output, targets.view(batch_size*seq_length))
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(net.parameters(), clip)
opt.step()
# loss stats
if counter % print_every == 0:
# Get validation loss
val_h = net.init_hidden(batch_size)
val_losses = []
net.eval()
for x, y in get_batches(val_data, batch_size, seq_length):
# One-hot encode our data and make them Torch tensors
x = one_hot_encode(x, n_chars)
x, y = torch.from_numpy(x), torch.from_numpy(y)
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
val_h = tuple([each.data for each in val_h])
inputs, targets = x, y
if(train_on_gpu):
inputs, targets = inputs.cuda(), targets.cuda()
output, val_h = net(inputs, val_h)
val_loss = criterion(output, targets.view(batch_size*seq_length))
val_losses.append(val_loss.item())
net.train() # reset to train mode after iterationg through validation data
print("Epoch: {}/{}...".format(e+1, epochs),
"Step: {}...".format(counter),
"Loss: {:.4f}...".format(loss.item()),
"Val Loss: {:.4f}".format(np.mean(val_losses)))
I don't know why.
some times i got this error after many trials
please help me if you can
You can find the notebook file here.

you duplicate this step
remove it then try again

How to implement contractive autoencoder in Pytorch?

I'm trying to create a contractive autoencoder in Pytorch. I found this thread and tried according to that. This is the snippet I wrote based on the mentioned thread:
import datetime
import numpy as np
import torch
import torchvision
from torchvision import datasets, transforms
from torchvision.utils import save_image, make_grid
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
%matplotlib inline
dataset_train = datasets.MNIST(root='MNIST',
train=True,
transform = transforms.ToTensor(),
download=True)
dataset_test = datasets.MNIST(root='MNIST',
train=False,
transform = transforms.ToTensor(),
download=True)
batch_size = 128
num_workers = 2
dataloader_train = torch.utils.data.DataLoader(dataset_train,
batch_size = batch_size,
shuffle=True,
num_workers = num_workers,
pin_memory=True)
dataloader_test = torch.utils.data.DataLoader(dataset_test,
batch_size = batch_size,
num_workers = num_workers,
pin_memory=True)
def view_images(imgs, labels, rows = 4, cols =11):
imgs = imgs.detach().cpu().numpy().transpose(0,2,3,1)
fig = plt.figure(figsize=(8,4))
for i in range(imgs.shape[0]):
ax = fig.add_subplot(rows, cols, i+1, xticks=[], yticks=[])
ax.imshow(imgs[i].squeeze(), cmap='Greys_r')
ax.set_title(labels[i].item())
# now let's view some
imgs, labels = next(iter(dataloader_train))
view_images(imgs, labels,13,10)
class Contractive_AutoEncoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(784, 512)
self.decoder = nn.Linear(512, 784)
def forward(self, input):
# flatten the input
shape = input.shape
input = input.view(input.size(0), -1)
output_e = F.relu(self.encoder(input))
output = F.sigmoid(self.decoder(output_e))
output = output.view(*shape)
return output_e, output
def loss_function(output_e, outputs, imgs, device):
output_e.backward(torch.ones(output_e.size()).to(device), retain_graph=True)
criterion = nn.MSELoss()
assert outputs.shape == imgs.shape ,f'outputs.shape : {outputs.shape} != imgs.shape : {imgs.shape}'
imgs.grad.requires_grad = True
loss1 = criterion(outputs, imgs)
print(imgs.grad)
loss2 = torch.mean(pow(imgs.grad,2))
loss = loss1 + loss2
return loss
epochs = 50
interval = 2000
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Contractive_AutoEncoder().to(device)
optimizer = optim.Adam(model.parameters(), lr =0.001)
for e in range(epochs):
for i, (imgs, labels) in enumerate(dataloader_train):
imgs = imgs.to(device)
labels = labels.to(device)
outputs_e, outputs = model(imgs)
loss = loss_function(outputs_e, outputs, imgs,device)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%interval:
print('')
print(f'epoch/epoechs: {e}/{epochs} loss : {loss.item():.4f} ')
For the sake of brevity I just used one layer for the encoder and the decoder. It should work regardless of number of layers in either of them obviously!
But the catch here is, aside from the fact that I don't know if this is the correct way of doing this, (calculating gradients with respect to the input), I get an error which makes the former solution wrong/not applicable.
That is:
imgs.grad.requires_grad = True
produces the error :
AttributeError : 'NoneType' object has no attribute 'requires_grad'
I also tried the second method suggested in that thread which is as follows:
class Contractive_Encoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(784, 512)
def forward(self, input):
# flatten the input
input = input.view(input.size(0), -1)
output_e = F.relu(self.encoder(input))
return output_e
class Contractive_Decoder(nn.Module):
def __init__(self):
super().__init__()
self.decoder = nn.Linear(512, 784)
def forward(self, input):
# flatten the input
output = F.sigmoid(self.decoder(input))
output = output.view(-1,1,28,28)
return output
epochs = 50
interval = 2000
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_enc = Contractive_Encoder().to(device)
model_dec = Contractive_Decoder().to(device)
optimizer = optim.Adam([{"params":model_enc.parameters()},
{"params":model_dec.parameters()}], lr =0.001)
optimizer_cond = optim.Adam(model_enc.parameters(), lr = 0.001)
criterion = nn.MSELoss()
for e in range(epochs):
for i, (imgs, labels) in enumerate(dataloader_train):
imgs = imgs.to(device)
labels = labels.to(device)
outputs_e = model_enc(imgs)
outputs = model_dec(outputs_e)
loss_rec = criterion(outputs, imgs)
optimizer.zero_grad()
loss_rec.backward()
optimizer.step()
imgs.requires_grad_(True)
y = model_enc(imgs)
optimizer_cond.zero_grad()
y.backward(torch.ones(imgs.view(-1,28*28).size()))
imgs.grad.requires_grad = True
loss = torch.mean([pow(imgs.grad,2)])
optimizer_cond.zero_grad()
loss.backward()
optimizer_cond.step()
if i%interval:
print('')
print(f'epoch/epoechs: {e}/{epochs} loss : {loss.item():.4f} ')
but I face the error :
RuntimeError: invalid gradient at index 0 - got [128, 784] but expected shape compatible with [128, 512]
How should I go about this in Pytorch?

Summary
The final implementation for contractive loss that I wrote is as follows:
def loss_function(output_e, outputs, imgs, lamda = 1e-4, device=torch.device('cuda')):
criterion = nn.MSELoss()
assert outputs.shape == imgs.shape ,f'outputs.shape : {outputs.shape} != imgs.shape : {imgs.shape}'
loss1 = criterion(outputs, imgs)
output_e.backward(torch.ones(outputs_e.size()).to(device), retain_graph=True)
# Frobenious norm, the square root of sum of all elements (square value)
# in a jacobian matrix
loss2 = torch.sqrt(torch.sum(torch.pow(imgs.grad,2)))
imgs.grad.data.zero_()
loss = loss1 + (lamda*loss2)
return loss
and inside training loop you need to do:
for e in range(epochs):
for i, (imgs, labels) in enumerate(dataloader_train):
imgs = imgs.to(device)
labels = labels.to(device)
imgs.retain_grad()
imgs.requires_grad_(True)
outputs_e, outputs = model(imgs)
loss = loss_function(outputs_e, outputs, imgs, lam,device)
imgs.requires_grad_(False)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'epoch/epochs: {e}/{epochs} loss: {loss.item():.4f}')
Full explanation
As it turns out and rightfully #akshayk07 pointed out in the comments, the implementation found in Pytorch forum was wrong in multiple places. The notable thing, being it wasn't implementing the actual contractive loss that was introduced in Contractive Auto-Encoders:Explicit Invariance During Feature Extraction paper! and also aside from that, the implementation wouldn't work at all for obvious reasons that will be explained in a moment.
The changes are obvious so I try to explain what's going on here. First of all note that imgs is not a leaf node, so the gradients would not be retained in the image .grad attribute.
In order to retain gradients for non leaf nodes, you should use retain_graph(). grad is only populated for leaf Tensors. Also imgs.retain_grad() should be called before doing forward() as it will instruct the autograd to store grads into non-leaf nodes.
Update
Thanks to #Michael for pointing out that the correct calculation of Frobenius Norm is actually (from ScienceDirect):
the square root of the sum of the squares of all the matrix entries
and not
the the square root of the sum of the absolute values of all the
matrix entries as explained here

In PyTorch 1.5.0, a high level torch.autograd.functional.jacobian API is added. This should make the contractive objective easier to implement for an arbitrary encoder. For torch>=v1.5.0, the contractive loss would look like this:
contractive_loss = torch.norm(torch.autograd.functional.jacobian(self.encoder, imgs, create_graph=True))
The create_graph argument makes the jacobian differentiable.

The main challenge in implementing the contractive autoencoder is in calculating the Frobenius norm of the Jacobian, which is the gradient of the code or bottleneck layer (vector) with respect to the input layer (vector). This is the regularization term in the loss function. Fortunately, you have done the hard work in solving this for me. Thank you! You are using MSE loss for the first term. Cross entropy loss is sometimes used instead. It's worth considering. I think you are almost there with the Frobenius norm, except that you need to take the square root of the sum of the squares of the Jacobian, where you are calculating the square root of the sum of the absolute values. Here's how I'd define the loss function (sorry I changed notation a little to keep myself straight):
def cae_loss_fcn(code, img_out, img_in, lamda=1e-4, device=torch.device('cuda')):
# First term in the loss function, for ensuring representational fidelity
criterion=nn.MSELoss()
assert img_out.shape == img_in.shape, f'img_out.shape : {img_out.shape} != img_in.shape : {img_in.shape}'
loss1 = criterion(img_out, img_in)
# Second term in the loss function, for enforcing contraction of representation
code.backward(torch.ones(code.size()).to(device), retain_graph=True)
# Frobenius norm of Jacobian of code with respect to input image
loss2 = torch.sqrt(torch.sum(torch.pow(img_in.grad, 2))) # THE CORRECTION
img_in.grad.data.zero_()
# Total loss, the sum of the two loss terms, with weight applied to second term
loss = loss1 + (lamda*loss2)
return loss

Why does this RNN in tensorflow not learn?

I am trying to train an RNN without using the RNN API in tensorflow (2) in Python 3.7, so the code is very basic. Something is going really wrong, but I'm not sure what it is.
As a reference, I am using a dataset from this tensorflow tutorial so I know what the error should roughly converge to. My RNN code is the following. What it is trying to do is use the previous 20 timesteps to predict the value of a series at the 21st timestep. I am training in batches of size 256.
While there is a decrease in loss over time, the ceiling is approximately 10x what it is if I follow the tutorial approach. Could it be some problem with the backpropagation through time?
state_size = 20 #dimensionality of the network
BATCH_SIZE = 256
#define recurrent weights and biases. W has 1 more dimension that the state
#dimension as also processes the inputs
W = tf.Variable(np.random.rand(state_size+1, state_size), dtype=tf.float32)
b = tf.Variable(np.zeros((1,state_size)), dtype=tf.float32)
#weights and biases for the output
W2 = tf.Variable(np.random.rand(state_size, 1),dtype=tf.float32)
b2 = tf.Variable(np.zeros((1,1)), dtype=tf.float32)
init_state = tf.Variable(np.random.normal(size=[BATCH_SIZE,state_size]),dtype='float32')
optimizer = tf.keras.optimizers.Adam(1e-3)
losses = []
for epoch in range(20):
with tf.GradientTape() as tape:
loss = 0
for batch_idx in range(200):
current_state = init_state
batchx = x_train_uni[batch_idx*BATCH_SIZE:(batch_idx+1)*BATCH_SIZE].swapaxes(0,1)
batchy = y_train_uni[batch_idx*BATCH_SIZE:(batch_idx+1)*BATCH_SIZE]
#forward pass through the timesteps
for x in batchx:
inst = tf.concat([current_state,x],1) #concatenate state and inputs for that timepoint
current_state = tf.tanh(tf.matmul(inst, W) + b) #
#predict using the hidden state after the full forward pass
pred = tf.matmul(current_state,W2) + b2
loss += tf.reduce_mean(tf.abs(batchy-pred))
#get gradients with respect to parameters
gradients = tape.gradient(loss, [W,b,W2,b2])
#apply gradients
optimizer.apply_gradients(zip(gradients, [W,b,W2,b2]))
losses.append(loss)
print(loss)

Matrix Factorization with PyTorch using GPU

The below code does run, but it's very slow as it's using a for loops. At my university, servers with GPU resources are available. Likewise, I'd like to understand how to use batches to train the model more effectively.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MatrixFactorization(torch.nn.Module):
def __init__(self, n_items=len(movie_ids), n_factors=300):
super().__init__()
self.vectors = nn.Embedding(n_items, n_factors,sparse=True)
def forward(self, i,j):
feat_i = self.vectors(i)
feat_j = self.vectors(j)
result = (feat_i * feat_j).sum(-1)
return result
model = MatrixFactorization(n_items= len(movie_ids),n_factors=300)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
epochs = 100
for epoch in range(epochs):
loss = 0
for r,c in zip(r_index, c_index):
i = torch.LongTensor([int(r)])
j = torch.LongTensor([int(c)])
rating = torch.FloatTensor([Xij[i, j]])
# predict
prediction = model(i, j)
loss += loss_fn(prediction, rating)
# Reset the gradients to 0
optimizer.zero_grad()
# backpropagate
loss.backward()
# update weights
optimizer.step()
print(loss)
I've tried the below alteration but it produced a warning. I'm not sure why my target sizes are mismatched, but that appears to be the cause of the issue.
epochs = 50
for epoch in range(epochs):
loss = 0
# predict
i = torch.LongTensor(r_index)
j = torch.LongTensor(c_index)
ratings = Xij[i, j]
prediction = model(i, j)
loss += loss_fn(prediction, rating)
# Reset the gradients to 0
optimizer.zero_grad()
# backpropagate
loss.backward()
# update weights
optimizer.step()
print(loss)
And the warning (not sure where I went wrong):
/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py:431: UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([5931640])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)

There is a typo in your second code snippet,
loss += loss_fn(prediction, ratings) # instead of rating

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Out of Memory When Performing Gradient Accumulation in Tensorflow - python

Related

How to add BiLSTM on top of BERT from Huggingface + CUDA out of memory. Tried to allocate 16.00 MiB

the Training is stopped for my PyTorch model

How to implement contractive autoencoder in Pytorch?

Why does this RNN in tensorflow not learn?

Matrix Factorization with PyTorch using GPU

Categories

Resources