I was following the tutorial on CoderzColumn to implement a LSTM for text classification using pytorch. I tried to apply the implementation on the bbc-news Dataset from Kaggle, however, it heavily overfits, achieving a max accuracy of about 60%.
See the train/loss curve for example:
Is there any advice (I am quite new to RNN/LSTM), to adapt the model to prevent that high overfiting?
The model is taken from the above tutorial and looks kind of like this:
class LSTMClassifier(nn.Module):
def __init__(self, vocab, target_classes, embed_len = 50, hidden_dim=75, n_layers=1):
super(LSTMClassifier, self).__init__()
self.n_layers = n_layers
self.embed_len = embed_len
self.hidden_dim = hidden_dim
self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
# self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim,dropout=0.2, num_layers=n_layers, batch_first=True)
self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
self.fc = nn.Linear(hidden_dim, len(target_classes))
def forward(self, X_batch):
embeddings = self.embedding_layer(X_batch)
hidden, carry = torch.randn(self.n_layers, len(X_batch), self.hidden_dim), torch.randn(self.n_layers, len(X_batch), self.hidden_dim)
output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
return self.fc(output[:,-1])
I would be really thankful for any adive how to adapt the version in the tutorial to use it more effectively on other datasets
Have you tried adding nn.Dropout layer before the self.fc?
Check what p = 0.1 / 0.2 / 0.3 will do.
Another thing you can do is to add regularisation to your training via weight_decay parameter:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
Use small values first, and increase by 10 times, see which will get you the best result.
Also, goes without saying, make sure that there is no test data points in train set. Make sure you did not forget to shuffle your train set:
train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
Related
I am trying to train RNN model to classify sentences into 4 classes, but it doesn’t seem to work. I tried to overfit 4 examples (blue line) which worked, but even as little as 8 examples (red line) is not working, let alone the whole dataset.
I tried different learning rates and sizes of hidden_size and embedding_size but it doesn’t seem to help, what am I missing? I know that if the model is not able to overfit small batch it means the capacity should be increased but in this case increasing capacity has no effect.
The architecture is as follows:
class RNN(nn.Module):
def __init__(self, embedding_size=256, hidden_size=128, num_classes=4):
super().__init__()
self.embedding = nn.Embedding(len(vocab), embedding_size, 0)
self.rnn = nn.RNN(embedding_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
#x=[batch_size, sequence_length]
x = self.embedding(x) #x=[batch_size, sequence_length, embedding_size]
_, h_n = self.rnn(x) #h_n=[1, batch_size, hidden_size]
h_n = h_n.squeeze(0)
out = self.fc(h_n) #out=[batch_size, num_classes]
return out
Input data is tokenized sentences, padded with 0 to the longest sentence in the batch, so as an example one sample would be: [2784, 9544, 1321, 120, 0, 0]. The data is from AG_NEWS dataset from torchtext datasets.
The training code:
model = RNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
model.train()
for epoch in range(NUM_EPOCHS):
epoch_losses = []
correct_predictions = []
for batch_idx, (labels, texts) in enumerate(train_loader):
scores = model(texts)
loss = criterion(scores, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
epoch_losses.append(loss.item())
correct = (scores.max(1).indices==labels).sum()
correct_predictions.append(correct)
epoch_avg_loss = sum(epoch_losses)/len(epoch_losses)
epoch_avg_accuracy = float(sum(correct_predictions))/float(len(labels))
The issue was due to the vanishing gradient.
I've just trained an LSTM language model using pytorch. The main body of the class is this:
class LM(nn.Module):
def __init__(self, n_vocab,
seq_size,
embedding_size,
lstm_size,
pretrained_embed):
super(LM, self).__init__()
self.seq_size = seq_size
self.lstm_size = lstm_size
self.embedding = nn.Embedding.from_pretrained(pretrained_embed, freeze = True)
self.lstm = nn.LSTM(embedding_size,
lstm_size,
batch_first=True)
self.fc = nn.Linear(lstm_size, n_vocab)
def forward(self, x, prev_state):
embed = self.embedding(x)
output, state = self.lstm(embed, prev_state)
logits = self.fc(output)
return logits, state
Now I want to write a function which calculates how good a sentence is, based on the trained language model (some score like perplexity, etc.).
I'm a bit confused and I don't know how should I calculate this. A similar sample would be of greate use.
When using Cross-Entropy loss you just use the exponential function torch.exp() calculate perplexity from your loss.
(pytorch cross-entropy also uses the exponential function resp. log_n)
So here is just some dummy example:
import torch
import torch.nn.functional as F
num_classes = 10
batch_size = 1
# your model outputs / logits
output = torch.rand(batch_size, num_classes)
# your targets
target = torch.randint(num_classes, (batch_size,))
# getting loss using cross entropy
loss = F.cross_entropy(output, target)
# calculating perplexity
perplexity = torch.exp(loss)
print('Loss:', loss, 'PP:', perplexity)
In my case the output is:
Loss: tensor(2.7935) PP: tensor(16.3376)
You just need to be beware of that if you want to get the per-word-perplexity you need to have per word loss as well.
Here is a neat example for a language model that might be interesting to look at that also computes the perplexity from the output:
https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model/main.py#L30-L50
Given input features as such, just raw numbers:
tensor([0.2153, 0.2190, 0.0685, 0.2127, 0.2145, 0.1260, 0.1480, 0.1483, 0.1489,
0.1400, 0.1906, 0.1876, 0.1900, 0.1925, 0.0149, 0.1857, 0.1871, 0.2715,
0.1887, 0.1804, 0.1656, 0.1665, 0.1137, 0.1668, 0.1168, 0.0278, 0.1170,
0.1189, 0.1163, 0.2337, 0.2319, 0.2315, 0.2325, 0.0519, 0.0594, 0.0603,
0.0586, 0.0067, 0.0624, 0.2691, 0.0617, 0.2790, 0.2805, 0.2848, 0.2454,
0.1268, 0.2483, 0.2454, 0.2475], device='cuda:0')
And the expected output is a single real number output, e.g.
tensor(-34.8500, device='cuda:0')
Full code on https://www.kaggle.com/alvations/pytorch-mlp-regression
I've tried creating a simple 2 layer network with:
class MLP(nn.Module):
def __init__(self, input_size, output_size, hidden_size):
super(MLP, self).__init__()
self.linear = nn.Linear(input_size, hidden_size)
self.classifier = nn.Linear(hidden_size, output_size)
def forward(self, inputs, hidden=None, dropout=0.5):
inputs = F.dropout(inputs, dropout) # Drop-in.
# First Layer.
output = F.relu(self.linear(inputs))
# Matrix manipulation magic.
batch_size, sequence_len, hidden_size = output.shape
# Technically, linear layer takes a 2-D matrix as input, so more manipulation...
output = output.contiguous().view(batch_size * sequence_len, hidden_size)
# Apply dropout.
output = F.dropout(output, dropout)
# Put it through the classifier
# And reshape it to [batch_size x sequence_len x vocab_size]
output = self.classifier(output).view(batch_size, sequence_len, -1)
return output
And training as such:
# Training routine.
def train(num_epochs, dataloader, valid_dataset, model, criterion, optimizer):
losses = []
valid_losses = []
learning_rates = []
plt.ion()
x_valid, y_valid = valid_dataset
for _e in range(num_epochs):
for batch in tqdm(dataloader):
# Zero gradient.
optimizer.zero_grad()
#print(batch)
this_x = torch.tensor(batch['x'].view(len(batch['x']), 1, -1)).to(device)
this_y = torch.tensor(batch['y'].view(len(batch['y']), 1, 1)).to(device)
# Feed forward.
output = model(this_x)
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, this_y.view(len(batch['y']), -1))
loss.backward()
optimizer.step()
losses.append(torch.sqrt(loss.float()).data)
with torch.no_grad():
# Zero gradient.
optimizer.zero_grad()
output = model(x_valid.view(len(x_valid), 1, -1))
prediction, _ = torch.max(output, dim=1)
loss = criterion(prediction, y_valid.view(len(y_valid), -1))
valid_losses.append(torch.sqrt(loss.float()).data)
clear_output(wait=True)
plt.plot(losses, label='Train')
plt.plot(valid_losses, label='Valid')
plt.legend()
plt.pause(0.05)
Tuning several hyperparameters, it looks like the model doesn't train well, the validation loss doesn't move at all e.g.
hyperparams = Hyperparams(input_size=train_dataset.x.shape[1],
output_size=1,
hidden_size=150,
loss_func=nn.MSELoss,
learning_rate=1e-8,
optimizer=optim.Adam,
batch_size=500)
And it's loss curve:
Any idea what's wrong with the network?
Am I training the regression model with the wrong loss? Or I've just not yet found the right hyperparameters?
Or am I validating the model wrongly?
From the code you provided, it is tough to say why the validation loss is constant but I see several problems in your code.
Why do you validate for each training mini-batch? Instead, you should validate your model after you do the training for one complete epoch (iterating over your full dataset once). So, the skeleton should be like:
for _e in range(num_epochs):
for batch in tqdm(train_dataloader):
# training code
with torch.no_grad():
for batch in tqdm(valid_dataloader):
# validation code
# plot your loss values
Also, you can plot after each epoch, not after each mini-batch training.
Did you check whether the model parameters are getting updated after optimizer.step() during training? How many validation examples do you have? Why don't you use mini-batch computation during validation?
Why do you do: optimizer.zero_grad() during validation? It doesn't make sense because, during validation, you are not going to do anything related to optimization.
You should use model.eval() during validation to turn off the dropouts. See PyTorch documentation to learn about .train() and .eval() methods.
The learning rate is set to 1e-8, isn't it too small? Why don't you use the default learning rate for Adam (1e-3)?
The following requires some reasoning.
Why are you using such a large batch size? What is your training dataset size?
You can directly plot the MSELoss, instead of taking the square root.
My suggestion would be: use some existing resources for MLP in PyTorch. Don't do it from scratch if you do not have sufficient knowledge at this point. It would make you suffer a lot.
Can we input a continuous video which contains sequences of both positive classes and negative classes to train LSTM, on several thousand of such videos?
My overall objective is to mark videos realtime with particular
scenes(e.g. if I’ve 0-100 frames and frame number 30-60 contains some
yoga scenes, I need to mark them)
Right now the approach which I’m following is to split the video into two categories of positive sequences and negative sequences and train LSTM (on top of Mobnet CNN, FC replaced by LSTM layers).
But somehow this does not give any improvement compared to Mobnet
alone when we run evaluation on non-split videos.
Both Mobnet and LSTM are trained separately. I save output of Mobnet(FC removed) in numpy arrays and then read these arrays for training LSTM.
Here is the sample of code used for this approach:
epochs = 250
batch_size = 128
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
in_size = 1024
classes_no = 2
hidden_size = 512
layer_no = 2
self.lstm = nn.LSTM(in_size, hidden_size, layer_no, batch_first=True)
self.linear = nn.Linear(hidden_size, classes_no)
def forward(self, input_seq):
output_seq, _ = self.lstm(input_seq)
last_output = output_seq[:,-1]
class_predictions = self.linear(last_output)
return class_predictions
def nploader(npfile):
a = np.load(npfile)
return a
def train():
npdataloader = torchvision.datasets.DatasetFolder('./featrs/',
nploader, ['npy'], transform=None, target_transform=None)
data_loader = torch.utils.data.DataLoader(npdataloader,
batch_size=batch_size,
shuffle=False,
num_workers=1)
model = Model().cuda()
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.8)
model.train()
for epoch in range(0, epochs):
for input_seq, target in data_loader:
optimizer.zero_grad()
output = model(input_seq.cuda())
err = loss(output.cuda(), target.cuda())
err.backward()
optimizer.step()
scheduler.step()
torch.save(model.state_dict(), 'lstm.ckpt')
I was going through this tutorial. I have a question about the following class code:
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax()
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def init_hidden(self):
return Variable(torch.zeros(1, self.hidden_size))
This code was taken from Here. There it was mentioned that
Since the state of the network is held in the graph and not in the layers, you can simply create an nn.Linear and reuse it over and over again for the recurrence.
What I don't understand is, how can one just increase input feature size in nn.Linear and say it is a RNN. What am I missing here?
The network is recurrent, because you evaluate multiple timesteps in the example.
The following code is also taken from the pytorch tutorial you linked to.
loss_fn = nn.MSELoss()
batch_size = 10
TIMESTEPS = 5
# Create some fake data
batch = torch.randn(batch_size, 50)
hidden = torch.zeros(batch_size, 20)
target = torch.zeros(batch_size, 10)
loss = 0
for t in range(TIMESTEPS):
# yes! you can reuse the same network several times,
# sum up the losses, and call backward!
hidden, output = rnn(batch, hidden)
loss += loss_fn(output, target)
loss.backward()
So the network itself is not recurrent, but in this loop you use it as a recurrent network by feeding the hidden state of the previous forward step together with your batch-input multiple times.
You could also use it non-recurrent by just backpropagating the loss in every step and ignoring the hidden state.
Since the state of the network is held in the graph and not in the layers, you can simply create an nn.Linear and reuse it over and over again for the recurrence.
This means, that the information to compute the gradient is not held in the model itself, so you can append multiple evaluations of the module to the graph and then backpropagate through the full graph.
This is described in the previous paragraphs of the tutorial.