Generation of sequences from latent space [pytorch] - python

I have several questions about best practice in using recurrent networks in pytorch for generation of sequences.
The first one, if I want to build decoder net should I use nn.GRU (or nn.LSTM) instead nn.LSTMCell (nn.GRUCell)? From my experience, if I work with LSTMCell the speed of calculations is drammatically lower (up to 100 times) than if I use nn.LSTM. Maybe it is related with cudnn optimisation for LSTM (and GRU) module? Is any way to speedup LSTMCell calculations?
I try to build an autoencoder, that accepts sequences of variable length. My autoencoder looks like:
class SimpleAutoencoder(nn.Module):
def init(self, input_size, hidden_size, n_layers=3):
super(SimpleAutoencoder, self).init()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.gru_encoder = nn.GRU(input_size, hidden_size,n_layers,batch_first=True)
self.gru_decoder = nn.GRU(input_size, hidden_size, n_layers, batch_first=True)
self.h2o = nn.Linear(hidden_size,input_size) # Hidden to output
def encode(self, input):
output, hidden = self.gru_encoder(input, None)
return output, hidden
def decode(self, input, hidden):
output,hidden = self.gru_decoder(input,hidden)
return output,hidden
def h2o_apply(self,input):
return self.h2o(input)
My training loop looks like:
one_hot_batch = list(map(lambda x:Variable(torch.FloatTensor(x)),one_hot_batch))
packed_one_hot_batch = pack_padded_sequence(pad_sequence(one_hot_batch,batch_first=True).cuda(),batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
for string,length,index in zip(*packed,range(batch_size)):
decoded_string_without_sos_symbol = vae.h2o_apply(string[1:length])
loss += criterion(decoded_string_without_sos_symbol,real_strings_batch[index][1:])
loss /= len(batch)
The training in such manner, as I can understand, is teacher force. Because at the decoding stage the network feeds the real inputs (outputs,_ = vae.decode(packed_one_hot_batch,latent)). But, for my task it leads to the situation when, in the test stage, network can generate sequences very well only if I use the real symbols (as in training mode), but if I feed the output of the previous step, the network generates rubbish (just infinite repetition of one specific symbol).
I tried another one approach. I generated “fake” inputs( just ones), to make the model generate only from the hidden state.
one_hot_batch_fake = list(map(lambda x:torch.ones_like(x).cuda(),one_hot_batch))
packed_one_hot_batch_fake = pack_padded_sequence(pad_sequence(one_hot_batch_fake, batch_first=True).cuda(), batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch_fake,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
It works, but very inefficiently, the quality of reconstruction is very low. So the second question, what is the right way to generate sequences from latent representation?
I suppose, that good idea is to apply teacher forcing with some probability, but for that, how one can use nn.GRU layer so the output of the previous step should be the input for the next step?

Related

How to Increase the Scope of Images a Neural Network Can Recognize?

I am working on an image recognition neural network with Pytorch. My goal is to take pictures of handwritten math equations, process them, and use the neural network to recognize each element. I've reached the point where I am able to separate every variable, number, or symbol from the equation, and everything is ready to be sent through the neural network. I've trained my network to recognize numbers quite well (this part was quite easy), but now I want to expand the scope of the neural network to recognizing letters as well as numbers. I loaded handwritten letters along with the numbers into tensors, shuffled the elements, and put them into batches. No matter how I vary my learning rate, my architecture (hidden layers and the number of neurons per layer), or my batch size I cannot get the neural network to recognize letters.
Here is my network architecture and the feed-forward function (you can see I experimented with the number of hidden layers):
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
inputNeurons, hiddenNeurons, outputNeurons = 784, 700, 36
# Create tensors for the weights
self.layerOne = nn.Linear(inputNeurons, hiddenNeurons)
self.layerTwo = nn.Linear(hiddenNeurons, hiddenNeurons)
self.layerThree = nn.Linear(hiddenNeurons, outputNeurons)
#self.layerFour = nn.Linear(hiddenNeurons, outputNeurons)
#self.layerFive = nn.Linear(hiddenNeurons, outputNeurons)
# Create function for Forward propagation
def Forward(self, input):
# Begin Forward propagation
input = torch.sigmoid(self.layerOne(torch.sigmoid(input)))
input = torch.sigmoid(self.layerTwo(input))
input = torch.sigmoid(self.layerThree(input))
#input = torch.sigmoid(self.layerFour(input))
#input = torch.sigmoid(self.layerFive(input))
return input
And this is the training code block (the data is shuffled in a dataloader, the ground truths are shuffled in the same order, batch size is 10, total number letter and number data points is 244800):
neuralNet = NeuralNetwork()
params = list(neuralNet.parameters())
criterion = nn.MSELoss()
print(neuralNet)
dataSet = next(iter(imageDataLoader))
groundTruth = next(iter(groundTruthsDataLoader))
for i in range(15):
for k in range(24480):
neuralNet.zero_grad()
prediction = neuralNet.Forward(dataSet)
loss = criterion(prediction, groundTruth)
loss.backward()
for layer in range(len(params)):
# Updating the weights of the neural network
params[layer].data.sub_(params[layer].grad.data * learningRate)
Thanks for the help in advance!
First thing i would recommend is writing a clean Pytorch code
For eg.
if i see your NeuralNetwork class it should have forward method (f in lower case),
so that you wont call it using prediction = neuralNet.Forward(dataSet). Reason being your hooks from neural network does not get dispatched if you use prediction = neuralNet.Forward(dataSet).
For more details refer this link
Second thing is : Since your dataset is not balance.....try to use undersampling / oversampling methods, which will be very helpful in your case.

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Difference between these implementations of LSTM Autoencoder?

Specifically what spurred this question is the return_sequence argument of TensorFlow's version of an LSTM layer.
The docs say:
Boolean. Whether to return the last output. in the output sequence,
or the full sequence. Default: False.
I've seen some implementations, especially autoencoders that use this argument to strip everything but the last element in the output sequence as the output of the 'encoder' half of the autoencoder.
Below are three different implementations. I'd like to understand the reasons behind the differences, as the seem like very large differences but all call themselves the same thing.
Example 1 (TensorFlow):
This implementation strips away all outputs of the LSTM except the last element of the sequence, and then repeats that element some number of times to reconstruct the sequence:
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_in,1)))
# Decoder below
model.add(RepeatVector(n_out))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
When looking at implementations of autoencoders in PyTorch, I don't see authors doing this. Instead they use the entire output of the LSTM for the encoder (sometimes followed by a dense layer and sometimes not).
Example 1 (PyTorch):
This implementation trains an embedding BEFORE an LSTM layer is applied... It seems to almost defeat the idea of an LSTM based auto-encoder... The sequence is already encoded by the time it hits the LSTM layer.
class EncoderLSTM(nn.Module):
def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
super(EncoderLSTM, self).__init__()
self.hidden_size = hidden_size
self.n_layers = n_layers
self.embedding = nn.Embedding(input_size, hidden_size)
self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)
def forward(self, inputs, hidden):
# Embed input words
embedded = self.embedding(inputs)
# Pass the embedded word vectors into LSTM and return all outputs
output, hidden = self.lstm(embedded, hidden)
return output, hidden
Example 2 (PyTorch):
This example encoder first expands the input with one LSTM layer, then does its compression via a second LSTM layer with a smaller number of hidden nodes. Besides the expansion, this seems in line with this paper I found: https://arxiv.org/pdf/1607.00148.pdf
However, in this implementation's decoder, there is no final dense layer. The decoding happens through a second lstm layer that expands the encoding back to the same dimension as the original input. See it here. This is not in line with the paper (although I don't know if the paper is authoritative or not).
class Encoder(nn.Module):
def __init__(self, seq_len, n_features, embedding_dim=64):
super(Encoder, self).__init__()
self.seq_len, self.n_features = seq_len, n_features
self.embedding_dim, self.hidden_dim = embedding_dim, 2 * embedding_dim
self.rnn1 = nn.LSTM(
input_size=n_features,
hidden_size=self.hidden_dim,
num_layers=1,
batch_first=True
)
self.rnn2 = nn.LSTM(
input_size=self.hidden_dim,
hidden_size=embedding_dim,
num_layers=1,
batch_first=True
)
def forward(self, x):
x = x.reshape((1, self.seq_len, self.n_features))
x, (_, _) = self.rnn1(x)
x, (hidden_n, _) = self.rnn2(x)
return hidden_n.reshape((self.n_features, self.embedding_dim))
Question:
I'm wondering about this discrepancy in implementations. The difference seems quite large. Are all of these valid ways to accomplish the same thing? Or are some of these mis-guided attempts at a "real" LSTM autoencoder?
There is no official or correct way of designing the architecture of an LSTM based autoencoder... The only specifics the name provides is that the model should be an Autoencoder and that it should use an LSTM layer somewhere.
The implementations you found are each different and unique on their own even though they could be used for the same task.
Let's describe them:
TF implementation:
It assumes the input has only one channel, meaning that each element in the sequence is just a number and that this is already preprocessed.
The default behaviour of the LSTM layer in Keras/TF is to output only the last output of the LSTM, you could set it to output all the output steps with the return_sequences parameter.
In this case the input data has been shrank to (batch_size, LSTM_units)
Consider that the last output of an LSTM is of course a function of the previous outputs (specifically if it is a stateful LSTM)
It applies a Dense(1) in the last layer in order to get the same shape as the input.
PyTorch 1:
They apply an embedding to the input before it is fed to the LSTM.
This is standard practice and it helps for example to transform each input element to a vector form (see word2vec for example where in a text sequence, each word that isn't a vector is mapped into a vector space). It is only a preprocessing step so that the data has a more meaningful form.
This does not defeat the idea of the LSTM autoencoder, because the embedding is applied independently to each element of the input sequence, so it is not encoded when it enters the LSTM layer.
PyTorch 2:
In this case the input shape is not (seq_len, 1) as in the first TF example, so the decoder doesn't need a dense after. The author used a number of units in the LSTM layer equal to the input shape.
In the end you choose the architecture of your model depending on the data you want to train on, specifically: the nature (text, audio, images), the input shape, the amount of data you have and so on...

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

LSTM timesteps in Sonnet

I'm currently trying to learn Sonnet.
My network (incomplete, the question is based on this):
class Model(snt.AbstractModule):
def __init__(self, name="LSTMNetwork"):
super(Model, self).__init__(name=name)
with self._enter_variable_scope():
self.l1 = snt.LSTM(100)
self.l2 = snt.LSTM(100)
self.out = snt.LSTM(10)
def _build(self, inputs):
# 'inputs' is of shape (batch_size, input_length)
# I need it to be of shape (batch_size, sequence_length, input_length)
l1_state = self.l1.initialize_state(np.shape(inputs)[0]) # init with batch_size
l2_state = self.l2.initialize_state(np.shape(inputs)[0]) # init with batch_size
out_state = self.out.initialize_state(np.shape(inputs)[0])
l1_out, l1_state = self.l1(inputs, l1_state)
l1_out = tf.tanh(l1_out)
l2_out, l2_state = self.l2(l1_out, l2_state)
l2_out = tf.tanh(l2_out)
output, out_state = self.out(l2_out, out_state)
output = tf.sigmoid(output)
return output, out_state
In other frameworks (eg. Keras), LSTM inputs are of the form (batch_size, sequence_length, input_length).
However, the Sonnet documentation states that the input to Sonnet's LSTM is of the form (batch_size, input_length).
How do I use them for sequential input?
So far, I've tried using a for loop inside _build, iterating over each timestep, but that gives seemingly random outputs.
I've tried the same architecture in Keras, which runs without any issues.
I'm executing in eager mode, using GradientTape for training.
We generally wrote the RNNs in Sonnet to work on a single timestep basis, as for Reinforcement Learning you often need to run one timestep to pick an action, and without that action you can't get the next observation (and the next input timestep) from the environment. It's easy to unroll a single timestep module over a sequence using tf.nn.dynamic_rnn (see below). We also have a wrapper which takes care of composing several RNN cores per timestep, which I believe is what you're looking to do. This has the advantage that the DeepCore object supports the start state methods required for dynamic_rnn, so it's API compatibe with LSTM or any other single-timestep module.
What you want to do should be achievable like this:
# Create a single-timestep RNN module by composing recurrent modules and
# non-recurrent ops.
model = snt.DeepRNN([
snt.LSTM(100),
tf.tanh,
snt.LSTM(100),
tf.tanh,
snt.LSTM(100),
tf.sigmoid
], skip_connections=False)
batch_size = 2
sequence_length = 3
input_size = 4
single_timestep_input = tf.random_uniform([batch_size, input_size])
sequence_input = tf.random_uniform([batch_size, sequence_length, input_size])
# Run the module on a single timestep
single_timestep_output, next_state = model(
single_timestep_input, model.initial_state(batch_size=batch_size))
# Unroll the module on a full sequence
sequence_output, final_state = tf.nn.dynamic_rnn(
core, sequence_input, dtype=tf.float32)
A few things to note - if you haven't already please have a look at the RNN example in the repository, as this shows a full graph mode training procedure setup around a fairly similar model.
Secondly, if you do end up needing to implement a more complex module that DeepRNN allows for, it's important to thread the recurrent state in and out of the module. In your example you're making the input state internally, and l1_state and l2_state as output are effectively discarded, so this can't be properly trained. If DeepRNN wasn't available, your model would look like this:
class LSTMNetwork(snt.RNNCore): # Note we inherit from the RNN-specific subclass
def __init__(self, name="LSTMNetwork"):
super(Model, self).__init__(name=name)
with self._enter_variable_scope():
self.l1 = snt.LSTM(100)
self.l2 = snt.LSTM(100)
self.out = snt.LSTM(10)
def initial_state(self, batch_size):
return (self.l1.initial_state(batch_size),
self.l2.initial_state(batch_size),
self.out.initial_state(batch_size))
def _build(self, inputs, prev_state):
# separate the components of prev_state
l1_prev_state, l2_prev_state, out_prev_state = prev_state
l1_out, l1_next_state = self.l1(inputs, l1_prev_state)
l1_out = tf.tanh(l1_out)
l2_out, l2_next_state = self.l2(l1_out, l2_prev_state)
l2_out = tf.tanh(l2_out)
output, out_next_state = self.out(l2_out, out_prev_state)
# Output state of LSTMNetwork contains the output states of inner modules.
full_output_state = (l1_next_state, l2_next_state, out_next_state)
return tf.sigmoid(output), full_output_state
Finally, if you're using eager mode I would strongly encourage you to have a look at Sonnet 2 - it's a complete rewrite for TF 2 / Eager mode. It's not backwards compatible, but all the same kinds of module compositions are possible. Sonnet 1 was written primarily for Graph mode TF, and while it does work with Eager mode you'll probably encounter some things that aren't very convenient.
We worked closely with the TensorFlow team to make sure that TF 2 & Sonnet 2 work nicely together, so please have a look: (https://github.com/deepmind/sonnet/tree/v2). Sonnet 2 should be considered alpha, and is being actively developed, so we don't have loads of examples yet, but more will be added in the near future.

Categories

Resources