I am trying to covert PyTorch code for CNN -> LSTM cascaded models to PyTorch Lightning.
There are two nn.Module classes in the PyTorch code, one for CNN (encoder) and one for LSTM (decoder); the last hidden layer of CNN is used as input to LSTM.
So, after conversion to PyTorch Lightning, there are two pl.LightningModule classes. I want to know how to populate the required methods in these two classes viz. forward(), training_step(), train_dataloader() and configure_optimizers().
Here are loss and optimizer definitions in PyTorch; optimizer uses parameters from both encoder and decoder models:
loss_criterion = nn.CrossEntropyLoss()
parameters = list(decoder_model.parameters()) + list(encoder_model.linear_layer.parameters()) + list(encoder_model.batch_norm.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.001)
Here is the relevant PyTorch code from nested for training loop, encoder output is fed as input to the decoder and then the loss is calculated using the decoder output:
feats = encoder_model(imgs)
outputs = decoder_model(feats, caps, lens)
loss = loss_criterion(outputs, tgts)
decoder_model.zero_grad()
encoder_model.zero_grad()
loss.backward()
optimizer.step()
A possible solution can be to have a "wrapper" model that instantiates and uses the CNN & LSTM models. CNN & LSTM models don't to subclass pl.LightningModule in this case. They can remain as nn.Module. An example structure will be as follows:
class WrapperModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder_model = EncoderModel()
self. decoder_model = DecoderModel()
def configure_optimizers(self):
<initialize optimizers>
def forward(self, imgs, caps, lens):
feats = self.encoder_model(imgs)
outputs = self.decoder_model(feats, caps, lens)
return outputs
def training_step(self, batch, batch_idx):
imgs, caps, lens, tgts = batch
outputs = self.forward(imgs, caps, lens)
loss = loss_criterion(outputs, tgts)
return loss
Related
I was trying out a simple LSTM use case form pytorch, with the following model.
class SimpleLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(batch_first=True, input_size=embedding_dim, num_layers=1, hidden_size=hidden_dim, bidirectional=True)
self.linear = nn.Linear(hidden_dim*2, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x): # NxD, padded to same length with 0s in N-sized batch
x = self.embedding(x)
output, (final_hidden_state, final_cell_state) = self.lstm(x)
x = self.linear(output[:,-1,:])
x=self.sigmoid(x)
return x
It is a binary classification, with BCELoss (combined with the Sigmoid output layer). Unfortunately, loss is stuck at 0.6969 (i.e. it is not learning anything).
I've tried using final_hidden_state, output[:,0,:] feeding into the linear layer, but so far no dice.
Everything else (optimizer, loss criterion, train loop, val loop) already works because I tried the exact same setup with a basic NN using nn.Embedding, nn.Linear, and nn.Sigmoid only, and could get to good loss decrease and high accuracy. In the SimpleLSTM, the only thing I added is the nn.LSTM.
Typically final_hidden_state is passed to linear, not output. Use it.
add 1-2 more linear layers after the LSTM.
try lower LR, especially when embeddings are not pre-trained.
Better yet, try loading pre-trained embeddings.
I want to add two separate layers on the top of one layer (or a pre-trained model) Is that possible for me to do using Pytorch?
Yes, when defining your model's forward function, you can specify how the inputs should be passed through the layers.
For example:
def forward(self, X):
X = self.common_layer(X)
X = self.activation_fn(X)
Xa = self.layer_a(X)
Xb = self.layer_b(X)
# now combine outputs of the parallel layers however you please
return self.combining_layer(torch.cat([Xa, Xb]))
Where forward is a member of MyNet:
class MyNet(nn.Module):
def __init__(self):
# define common_layer, activation_fn, layer_a, layer_b, and combining_layer
Training
The model should be trained, just like any other pytorch model.
Because the combined output goes through both layer_a and layer_b, computing gradient of the loss will optimize paramaters for both layers. The layers will be optimized seperately as the parameters of their object are independent.
For example.
model = MyNet()
...
optimizer.zero_grad()
predictions = model(input_batch)
loss = my_loss_fn(predictions, ground_truth)
loss.backward()
optimizer.step()
I'm using pytorch and I'm using the base pretrained bert to classify sentences for hate speech.
I want to implement a Bi-LSTM layer that takes as an input all outputs of the latest
transformer encoder from the bert model as a new model (class that implements nn.Module), and i got confused with the nn.LSTM parameters.
I tokenized the data using
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=int(data['class'].nunique()),output_attentions=False,output_hidden_states=False)
My data-set has 2 columns: class(label), sentence.
Can someone help me with this?
Thank you in advance.
Edit:
Also, after processing the input in the bi-lstm, the network sends the final hidden state to a fully connected network that performs classication using the softmax activation function. how can I do that ?
You can do it as follows:
from transformers import BertModel
class CustomBERTModel(nn.Module):
def __init__(self):
super(CustomBERTModel, self).__init__()
self.bert = BertModel.from_pretrained("bert-base-uncased")
### New layers:
self.lstm = nn.LSTM(768, 256, batch_first=True,bidirectional=True)
self.linear = nn.Linear(256*2, <number_of_classes>)
def forward(self, ids, mask):
sequence_output, pooled_output = self.bert(
ids,
attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768)
lstm_output, (h,c) = self.lstm(sequence_output) ## extract the 1st token's embeddings
hidden = torch.cat((lstm_output[:,-1, :256],lstm_output[:,0, 256:]),dim=-1)
linear_output = self.linear(hidden.view(-1,256*2)) ### assuming that you are only using the output of the last LSTM cell to perform classification
return linear_output
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = CustomBERTModel()
I want to add a dense layer on top of the bare BERT Model transformer outputting raw hidden-states, and then fine tune the resulting model. Specifically, I am using this base model. This is what the model should do:
Encode the sentence (a vector with 768 elements for each token of the sentence)
Keep only the first vector (related to the first token)
Add a dense layer on top of this vector, to get the desired transformation
So far, I have successfully encoded the sentences:
from sklearn.neural_network import MLPRegressor
import torch
from transformers import AutoModel, AutoTokenizer
# List of strings
sentences = [...]
# List of numbers
labels = [...]
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
# 2D array, one line per sentence containing the embedding of the first token
encoded_sentences = torch.stack([model(**tokenizer(s, return_tensors='pt'))[0][0][0]
for s in sentences]).detach().numpy()
regr = MLPRegressor()
regr.fit(encoded_sentences, labels)
In this way I can train a neural network by feeding it with the encoded sentences. However, this approach clearly does not fine tune the base BERT model. Can anybody help me? How can I build a model (possibly in pytorch or using the Huggingface library) that can be entirely fine tuned?
There are two ways to do it: Since you are looking to fine-tune the model for a downstream task similar to classification, you can directly use:
BertForSequenceClassification class. Performs fine-tuning of logistic regression layer on the output dimension of 768.
Alternatively, you can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.
from transformers import BertModel
class CustomBERTModel(nn.Module):
def __init__(self):
super(CustomBERTModel, self).__init__()
self.bert = BertModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
### New layers:
self.linear1 = nn.Linear(768, 256)
self.linear2 = nn.Linear(256, 3) ## 3 is the number of classes in this example
def forward(self, ids, mask):
sequence_output, pooled_output = self.bert(
ids,
attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768)
linear1_output = self.linear1(sequence_output[:,0,:].view(-1,768)) ## extract the 1st token's embeddings
linear2_output = self.linear2(linear2_output)
return linear2_output
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = CustomBERTModel() # You can pass the parameters if required to have more flexible model
model.to(torch.device("cpu")) ## can be gpu
criterion = nn.CrossEntropyLoss() ## If required define your own criterion
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
for epoch in epochs:
for batch in data_loader: ## If you have a DataLoader() object to get the data.
data = batch[0]
targets = batch[1] ## assuming that data loader returns a tuple of data and its targets
optimizer.zero_grad()
encoding = tokenizer.batch_encode_plus(data, return_tensors='pt', padding=True, truncation=True,max_length=50, add_special_tokens = True)
outputs = model(input_ids, attention_mask=attention_mask)
outputs = F.log_softmax(outputs, dim=1)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
If you want to tune the BERT model itself you will need to modify the parameters of the model. To do this you will most likely want to do your work with PyTorch. Here is some rough psuedo code to illustrate:
from torch.optim import SGD
model = ... # whatever model you are using
parameters = model.parameters() # or some more specific set of parameters
optimizer = SGD(parameters,lr=.01) # or whatever optimizer you want
optimizer.zero_grad() # boiler-platy pytorch function
input = ... # whatever the appropriate input for your task is
label = ... # whatever the appropriate label for your task is
loss = model(**input, label) # usuall loss is the first item returned
loss.backward() # calculates gradient
optim.step() # runs optimization algorithm
I've left out all the relevant details because they are quite tedious and specific to whatever your specific task is. Huggingface has a nice article walking through this is more detail here, and you will definitely want to refer to some pytorch documentation as you use any pytorch stuff. I highly recommend the pytorch blitz before trying to do anything serious with it.
For anyone using Tensorflow/ Keras the equivalent of Ashwin's answer would be:
from tensorflow import keras
from transformers import AutoTokenizer, TFAutoModel
class CustomBERTModel(keras.Model):
def __init__(self):
super(CustomBERTModel, self).__init__()
self.bert = TFAutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
### New layers:
self.linear1 = keras.layers.Dense(256)
self.linear2 = keras.layers.Dense(3) ## 3 is the number of classes in this example
def call(self, inputs, training=False):
# call expects only one positional argument, so you have to pass in a tuple and unpack. The next parameter is a special reserved training parameter.
ids, mask = inputs
sequence_output = self.bert(ids, mask, training=training).last_hidden_state
# sequence_output has the following shape: (batch_size, sequence_length, 768)
linear1_output = self.linear1(sequence_output[:,0,:]) ## extract the 1st token's embeddings
linear2_output = self.linear2(linear1_output)
return linear2_output
model = CustomBERTModel()
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
ipts = tokenizer("Some input sequence", return_tensors="tf")
test = model((ipts["input_ids"], ipts["attention_mask"]))
Then to train the model you can make a custom training loop using GradientTape.
You can verify that the additional layers are also trainable with model.trainable_weights. You can access weights for individual layers with e.g. model.trainable_weights[-1].numpy() would get the last layer's bias vector. [Note the Dense layers will only appear after the first time the call method is executed.]
I am building a hurricane track predictor using satellite data. I have a multiple to many output in a multilayer LSTM model, with input and output arrays following the structure [samples[time[features]]]. I have as features of inputs and outputs the coordinates of the hurricane, WS, and other dimensions.
The problem is that the error reduction, and as a consequence, the model predicts always a constant. After reading several posts, I standardized the data, removed some unnecessary layers, but still, the model always predicts the same output.
I think the model is big enough, activation functions make sense, given that the outputs are all within [-1;1].
So my questions are : What am I doing wrong ?
The model is the following:
class Stacked_LSTM():
def __init__(self, training_inputs, training_outputs, n_steps_in, n_steps_out, n_features_in, n_features_out, metrics, optimizer, epochs):
self.training_inputs = training_inputs
self.training_outputs = training_outputs
self.epochs = epochs
self.n_steps_in = n_steps_in
self.n_steps_out = n_steps_out
self.n_features_in = n_features_in
self.n_features_out = n_features_out
self.metrics = metrics
self.optimizer = optimizer
self.stop = EarlyStopping(monitor='loss', min_delta=0.000000000001, patience=30)
self.model = Sequential()
self.model.add(LSTM(360, activation='tanh', return_sequences=True, input_shape=(self.n_steps_in, self.n_features_in,))) #, kernel_regularizer=regularizers.l2(0.001), not a good idea
self.model.add(layers.Dropout(0.1))
self.model.add(LSTM(360, activation='tanh'))
self.model.add(layers.Dropout(0.1))
self.model.add(Dense(self.n_features_out*self.n_steps_out))
self.model.add(Reshape((self.n_steps_out, self.n_features_out)))
self.model.compile(optimizer=self.optimizer, loss='mae', metrics=[metrics])
def fit(self):
return self.model.fit(self.training_inputs, self.training_outputs, callbacks=[self.stop], epochs=self.epochs)
def predict(self, input):
return self.model.predict(input)
Notes
1) In this particular problem, the time series data is not "continuous", because one time serie belongs to a particular hurricane. I have therefore adapted the training and test samples of the time series to each hurricane. The implication of this is that I cannot use the function stateful=True in my layers because it would then mean that the model doesn't makes any difference between the different hurricanes (if my understanding is correct).
2) No image data, so no convolutionnal model needed.
Few suggestions, based on my experience:
4 layers of LSTM is too much. Stick to two, maximum three.
Don't use relu as activations for LSTMs.
Do not use BatchNormalization for time-series.
Other than these, I'd also suggest removing the dense layers between two LSTM layers.