LSTM autoencoder for anomaly detection

LSTM autoencoder for anomaly detection - python

I'm testing out different implementation of LSTM autoencoder on anomaly detection on 2D input.
My question is not about the code itself but about understanding the underlying behavior of each network.
Both implementation have the same number of units (16). Model 2 is a "typical" seq to seq autoencoder with the last sequence of the encoder repeated "n" time to match the input of the decoder.
I'd like to understand why Model 1 seem to easily over-perform Model 2 and why Model 2 isn't able to do better than the mean ?
Model 1:
class LSTM_Detector(Model):
def __init__(self, flight_len, param_len, hidden_state=16):
super(LSTM_Detector, self).__init__()
self.input_dim = (flight_len, param_len)
self.units = hidden_state
self.encoder = layers.LSTM(self.units,
return_state=True,
return_sequences=True,
activation="tanh",
name='encoder',
input_shape=self.input_dim)
self.decoder = layers.LSTM(self.units,
return_sequences=True,
activation="tanh",
name="decoder",
input_shape=(self.input_dim[0],self.units))
self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
def call(self, x):
output, hs, cs = self.encoder(x)
encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn
decoded = self.decoder(output, initial_state=encoded_state)
output_decoder = self.dense(decoded)
return output_decoder
Model 2:
class Seq2Seq_Detector(Model):
def __init__(self, flight_len, param_len, hidden_state=16):
super(Seq2Seq_Detector, self).__init__()
self.input_dim = (flight_len, param_len)
self.units = hidden_state
self.encoder = layers.LSTM(self.units,
return_state=True,
return_sequences=False,
activation="tanh",
name='encoder',
input_shape=self.input_dim)
self.repeat = layers.RepeatVector(self.input_dim[0])
self.decoder = layers.LSTM(self.units,
return_sequences=True,
activation="tanh",
name="decoder",
input_shape=(self.input_dim[0],self.units))
self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
def call(self, x):
output, hs, cs = self.encoder(x)
encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn
repeated_vec = self.repeat(output)
decoded = self.decoder(repeated_vec, initial_state=encoded_state)
output_decoder = self.dense(decoded)
return output_decoder
I fitted this 2 models for 200 Epochs on a sample of data (89, 1500, 77) each input being a 2D aray of (1500, 77). And the test data (10,1500,77). Both model had only 16 units.
Here or the results of the autoencoder on one features of the test data.
Results Model 1: (black line is truth, red in reconstructed image)
Results Model 2:
I understand the second one is more restrictive since all the information from the input sequence is compressed into one step, but I'm still surprise that it's barely able to do better than predict the average.
On the other hand, I feel Model 1 tends to be more "influenced" by new data without giving back the input. see example below of Model 1 having a flat line as input :
PS : I know it's not a lot of data for that kind of model, I have much more available but at this stage I'm just experimenting and trying to build my understanding.
PS 2 : Neither models overfitted their data and the training and validation curve are almost text book like.
Is anyone able to explain why there is such a gap in term of behavior ?
Thank you

In model 1, each point of 77 features is compressed and decompressed this way: 77->16->16->77 plus some info from the previous steps. It seems that replacing LSTMs with just TimeDistributed(Dense(...)) may also work in this case, but cannot say for sure as I don't know the data. The third image may become better.
What predicts model 2 usually happens when there is no useful signal in the input and the best thing model can do (well, optimize to do) is just to predict the mean target value of the training set.
In model 2 you have:
...
self.encoder = layers.LSTM(self.units,
return_state=True,
return_sequences=False,
...
and then
self.repeat = layers.RepeatVector(self.input_dim[0])
So, in fact, when it does
repeated_vec = self.repeat(output)
decoded = self.decoder(repeated_vec, initial_state=encoded_state)
it just takes only one last output from the encoder (which in this case represents the last step of 1500), copies it 1500 times (input_dim[0]), and tries to predict all 1500 values from the information about a couple of last ones. Here is where the model loses most of the useful signal. It does not have enough/any information about the input, and the best thing it can learn in order to minimize the loss function (which I suppose in this case is MSE or MAE) is to predict the mean value for each of the features.
Also, a seq to seq model usually passes a prediction of a decoder step as an input to the next decoder step, in the current case, it is always the same value.
TL;DR 1) seq-to-seq is not the best model for this case; 2) due to the bottleneck it cannot really learn to do anything better than just to predict the mean value for each feature.

Related

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?

Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Difference between these implementations of LSTM Autoencoder?

Specifically what spurred this question is the return_sequence argument of TensorFlow's version of an LSTM layer.
The docs say:
Boolean. Whether to return the last output. in the output sequence,
or the full sequence. Default: False.
I've seen some implementations, especially autoencoders that use this argument to strip everything but the last element in the output sequence as the output of the 'encoder' half of the autoencoder.
Below are three different implementations. I'd like to understand the reasons behind the differences, as the seem like very large differences but all call themselves the same thing.
Example 1 (TensorFlow):
This implementation strips away all outputs of the LSTM except the last element of the sequence, and then repeats that element some number of times to reconstruct the sequence:
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_in,1)))
# Decoder below
model.add(RepeatVector(n_out))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
When looking at implementations of autoencoders in PyTorch, I don't see authors doing this. Instead they use the entire output of the LSTM for the encoder (sometimes followed by a dense layer and sometimes not).
Example 1 (PyTorch):
This implementation trains an embedding BEFORE an LSTM layer is applied... It seems to almost defeat the idea of an LSTM based auto-encoder... The sequence is already encoded by the time it hits the LSTM layer.
class EncoderLSTM(nn.Module):
def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
super(EncoderLSTM, self).__init__()
self.hidden_size = hidden_size
self.n_layers = n_layers
self.embedding = nn.Embedding(input_size, hidden_size)
self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)
def forward(self, inputs, hidden):
# Embed input words
embedded = self.embedding(inputs)
# Pass the embedded word vectors into LSTM and return all outputs
output, hidden = self.lstm(embedded, hidden)
return output, hidden
Example 2 (PyTorch):
This example encoder first expands the input with one LSTM layer, then does its compression via a second LSTM layer with a smaller number of hidden nodes. Besides the expansion, this seems in line with this paper I found: https://arxiv.org/pdf/1607.00148.pdf
However, in this implementation's decoder, there is no final dense layer. The decoding happens through a second lstm layer that expands the encoding back to the same dimension as the original input. See it here. This is not in line with the paper (although I don't know if the paper is authoritative or not).
class Encoder(nn.Module):
def __init__(self, seq_len, n_features, embedding_dim=64):
super(Encoder, self).__init__()
self.seq_len, self.n_features = seq_len, n_features
self.embedding_dim, self.hidden_dim = embedding_dim, 2 * embedding_dim
self.rnn1 = nn.LSTM(
input_size=n_features,
hidden_size=self.hidden_dim,
num_layers=1,
batch_first=True
)
self.rnn2 = nn.LSTM(
input_size=self.hidden_dim,
hidden_size=embedding_dim,
num_layers=1,
batch_first=True
)
def forward(self, x):
x = x.reshape((1, self.seq_len, self.n_features))
x, (_, _) = self.rnn1(x)
x, (hidden_n, _) = self.rnn2(x)
return hidden_n.reshape((self.n_features, self.embedding_dim))
Question:
I'm wondering about this discrepancy in implementations. The difference seems quite large. Are all of these valid ways to accomplish the same thing? Or are some of these mis-guided attempts at a "real" LSTM autoencoder?

There is no official or correct way of designing the architecture of an LSTM based autoencoder... The only specifics the name provides is that the model should be an Autoencoder and that it should use an LSTM layer somewhere.
The implementations you found are each different and unique on their own even though they could be used for the same task.
Let's describe them:
TF implementation:
It assumes the input has only one channel, meaning that each element in the sequence is just a number and that this is already preprocessed.
The default behaviour of the LSTM layer in Keras/TF is to output only the last output of the LSTM, you could set it to output all the output steps with the return_sequences parameter.
In this case the input data has been shrank to (batch_size, LSTM_units)
Consider that the last output of an LSTM is of course a function of the previous outputs (specifically if it is a stateful LSTM)
It applies a Dense(1) in the last layer in order to get the same shape as the input.
PyTorch 1:
They apply an embedding to the input before it is fed to the LSTM.
This is standard practice and it helps for example to transform each input element to a vector form (see word2vec for example where in a text sequence, each word that isn't a vector is mapped into a vector space). It is only a preprocessing step so that the data has a more meaningful form.
This does not defeat the idea of the LSTM autoencoder, because the embedding is applied independently to each element of the input sequence, so it is not encoded when it enters the LSTM layer.
PyTorch 2:
In this case the input shape is not (seq_len, 1) as in the first TF example, so the decoder doesn't need a dense after. The author used a number of units in the LSTM layer equal to the input shape.
In the end you choose the architecture of your model depending on the data you want to train on, specifically: the nature (text, audio, images), the input shape, the amount of data you have and so on...

Time Series Forecasting model with LSTM in Tensorflow predicts a constant

I am building a hurricane track predictor using satellite data. I have a multiple to many output in a multilayer LSTM model, with input and output arrays following the structure [samples[time[features]]]. I have as features of inputs and outputs the coordinates of the hurricane, WS, and other dimensions.
The problem is that the error reduction, and as a consequence, the model predicts always a constant. After reading several posts, I standardized the data, removed some unnecessary layers, but still, the model always predicts the same output.
I think the model is big enough, activation functions make sense, given that the outputs are all within [-1;1].
So my questions are : What am I doing wrong ?
The model is the following:
class Stacked_LSTM():
def __init__(self, training_inputs, training_outputs, n_steps_in, n_steps_out, n_features_in, n_features_out, metrics, optimizer, epochs):
self.training_inputs = training_inputs
self.training_outputs = training_outputs
self.epochs = epochs
self.n_steps_in = n_steps_in
self.n_steps_out = n_steps_out
self.n_features_in = n_features_in
self.n_features_out = n_features_out
self.metrics = metrics
self.optimizer = optimizer
self.stop = EarlyStopping(monitor='loss', min_delta=0.000000000001, patience=30)
self.model = Sequential()
self.model.add(LSTM(360, activation='tanh', return_sequences=True, input_shape=(self.n_steps_in, self.n_features_in,))) #, kernel_regularizer=regularizers.l2(0.001), not a good idea
self.model.add(layers.Dropout(0.1))
self.model.add(LSTM(360, activation='tanh'))
self.model.add(layers.Dropout(0.1))
self.model.add(Dense(self.n_features_out*self.n_steps_out))
self.model.add(Reshape((self.n_steps_out, self.n_features_out)))
self.model.compile(optimizer=self.optimizer, loss='mae', metrics=[metrics])
def fit(self):
return self.model.fit(self.training_inputs, self.training_outputs, callbacks=[self.stop], epochs=self.epochs)
def predict(self, input):
return self.model.predict(input)
Notes
1) In this particular problem, the time series data is not "continuous", because one time serie belongs to a particular hurricane. I have therefore adapted the training and test samples of the time series to each hurricane. The implication of this is that I cannot use the function stateful=True in my layers because it would then mean that the model doesn't makes any difference between the different hurricanes (if my understanding is correct).
2) No image data, so no convolutionnal model needed.

Few suggestions, based on my experience:
4 layers of LSTM is too much. Stick to two, maximum three.
Don't use relu as activations for LSTMs.
Do not use BatchNormalization for time-series.
Other than these, I'd also suggest removing the dense layers between two LSTM layers.

Generation of sequences from latent space [pytorch]

I have several questions about best practice in using recurrent networks in pytorch for generation of sequences.
The first one, if I want to build decoder net should I use nn.GRU (or nn.LSTM) instead nn.LSTMCell (nn.GRUCell)? From my experience, if I work with LSTMCell the speed of calculations is drammatically lower (up to 100 times) than if I use nn.LSTM. Maybe it is related with cudnn optimisation for LSTM (and GRU) module? Is any way to speedup LSTMCell calculations?
I try to build an autoencoder, that accepts sequences of variable length. My autoencoder looks like:
class SimpleAutoencoder(nn.Module):
def init(self, input_size, hidden_size, n_layers=3):
super(SimpleAutoencoder, self).init()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.gru_encoder = nn.GRU(input_size, hidden_size,n_layers,batch_first=True)
self.gru_decoder = nn.GRU(input_size, hidden_size, n_layers, batch_first=True)
self.h2o = nn.Linear(hidden_size,input_size) # Hidden to output
def encode(self, input):
output, hidden = self.gru_encoder(input, None)
return output, hidden
def decode(self, input, hidden):
output,hidden = self.gru_decoder(input,hidden)
return output,hidden
def h2o_apply(self,input):
return self.h2o(input)
My training loop looks like:
one_hot_batch = list(map(lambda x:Variable(torch.FloatTensor(x)),one_hot_batch))
packed_one_hot_batch = pack_padded_sequence(pad_sequence(one_hot_batch,batch_first=True).cuda(),batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
for string,length,index in zip(*packed,range(batch_size)):
decoded_string_without_sos_symbol = vae.h2o_apply(string[1:length])
loss += criterion(decoded_string_without_sos_symbol,real_strings_batch[index][1:])
loss /= len(batch)
The training in such manner, as I can understand, is teacher force. Because at the decoding stage the network feeds the real inputs (outputs,_ = vae.decode(packed_one_hot_batch,latent)). But, for my task it leads to the situation when, in the test stage, network can generate sequences very well only if I use the real symbols (as in training mode), but if I feed the output of the previous step, the network generates rubbish (just infinite repetition of one specific symbol).
I tried another one approach. I generated “fake” inputs( just ones), to make the model generate only from the hidden state.
one_hot_batch_fake = list(map(lambda x:torch.ones_like(x).cuda(),one_hot_batch))
packed_one_hot_batch_fake = pack_padded_sequence(pad_sequence(one_hot_batch_fake, batch_first=True).cuda(), batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch_fake,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
It works, but very inefficiently, the quality of reconstruction is very low. So the second question, what is the right way to generate sequences from latent representation?
I suppose, that good idea is to apply teacher forcing with some probability, but for that, how one can use nn.GRU layer so the output of the previous step should be the input for the next step?

Load saved checkpoint and predict not producing same results as in training

I'm training based on a sample code I found on the Internet. The accuracy in testing is at 92% and the checkpoints are saved in a directory. In parallel (the training is running for 3 days now) I want to create my prediction code so I can learn more instead of just waiting.
This is my third day of deep learning so I probably don't know what I'm doing. Here's how I'm trying to predict:
Instantiate the model using the same code as in training
Load the last checkpoint
Try to predict
The code works but the results are nowhere near 90%.
Here's how I create the model:
INPUT_LAYERS = 2
OUTPUT_LAYERS = 2
AMOUNT_OF_DROPOUT = 0.3
HIDDEN_SIZE = 700
INITIALIZATION = "he_normal" # : Gaussian initialization scaled by fan_in (He et al., 2014)
CHARS = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .")
def generate_model(output_len, chars=None):
"""Generate the model"""
print('Build model...')
chars = chars or CHARS
model = Sequential()
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
# note: in a situation where your input sequences have a variable length,
# use input_shape=(None, nb_feature).
for layer_number in range(INPUT_LAYERS):
model.add(recurrent.LSTM(HIDDEN_SIZE, input_shape=(None, len(chars)), init=INITIALIZATION,
return_sequences=layer_number + 1 < INPUT_LAYERS))
model.add(Dropout(AMOUNT_OF_DROPOUT))
# For the decoder's input, we repeat the encoded input for each time step
model.add(RepeatVector(output_len))
# The decoder RNN could be multiple layers stacked or a single layer
for _ in range(OUTPUT_LAYERS):
model.add(recurrent.LSTM(HIDDEN_SIZE, return_sequences=True, init=INITIALIZATION))
model.add(Dropout(AMOUNT_OF_DROPOUT))
# For each of step of the output sequence, decide which character should be chosen
model.add(TimeDistributed(Dense(len(chars), init=INITIALIZATION)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
In a separate file predict.py I import this method to create my model and try to predict:
...import code
model = generate_model(len(question), dataset['chars'])
model.load_weights('models/weights.204-0.20.hdf5')
def decode(pred):
return character_table.decode(pred, calc_argmax=False)
x = np.zeros((1, len(question), len(dataset['chars'])))
for t, char in enumerate(question):
x[0, t, character_table.char_indices[char]] = 1.
preds = model.predict_classes([x], verbose=0)[0]
print("======================================")
print(decode(preds))
I don't know what the problem is. I have about 90 checkpoints in my directory and I'm loading the last one based on accuracy. All of them saved by a ModelCheckpoint:
checkpoint = ModelCheckpoint(MODEL_CHECKPOINT_DIRECTORYNAME + '/' + MODEL_CHECKPOINT_FILENAME,
save_best_only=True)
I'm stuck. What am I doing wrong?

In the repo you provided, the training and validation sentences are inverted before being fed into the model (as commonly done in seq2seq learning).
dataset = DataSet(DATASET_FILENAME)
As you can see, the default value for inverted is True, and the questions are inverted.
class DataSet(object):
def __init__(self, dataset_filename, test_set_fraction=0.1, inverted=True):
self.inverted = inverted
...
question = question[::-1] if self.inverted else question
questions.append(question)
You can try to invert the sentences during prediction. Specifically,
x = np.zeros((1, len(question), len(dataset['chars'])))
for t, char in enumerate(question):
x[0, len(question) - t - 1, character_table.char_indices[char]] = 1.

When you generate model in your predict.py file:
model = generate_model(len(question), dataset['chars'])
is your first parameter the same as in your training file? Or is the question length dynamic? If so, you are generating different model, thus your saved checkpoint doesn't work.

It can be the dimentionality of arrays/df passed not matching what is expected by the functions you call. When a single dimention is expected by the called method, try ravel on what you expect to be a single dimention

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

LSTM autoencoder for anomaly detection - python

Related

LSTM Autoencoder problems

Difference between these implementations of LSTM Autoencoder?

Time Series Forecasting model with LSTM in Tensorflow predicts a constant

Generation of sequences from latent space [pytorch]

Load saved checkpoint and predict not producing same results as in training

Categories

Resources