How to Build Simple LSTM Model for Anomally Detection - python

I want to create a LSTM model in PyTorch that will be used for anomally detection, but I'm having trouble understanding the details in doing so.
Note, my training-data consists of sets with 16 features in 80 time-steps. Here is what I've written for the model below:
class AutoEncoder(torch.nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim):
super(AutoEncoder, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
self.fc1 = torch.nn.Linear(hidden_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, input_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
out = self.fc1(out[:, -1, :])
out = self.fc2(out)
return out
input_dim = 16
hidden_dim = 8
layer_dim = 2
model = AutoEncoder(input_dim, hidden_dim, layer_dim)
I don't think I've built the model correctly. How does it know that I'm feeding it 80 time-steps of data? How will the auto-encoder reconstruct those 80 time-steps of data?
I'm having a hard time understanding the material online. What would the final layer have to be?

If you check out the PyTorch LSTM documentation, you will see that the LSTM equations are applied to each timestep in your sequence. nn.LSTM will internally obtain the seq_len dimension and optimize from there, so you do not need to provide the number of time steps.
At the moment, the line
out = self.fc1(out[:, -1, :])
is selecting the final hidden state (corresponding to time step 80) and then this is being projected onto a space of size input_dim.
To output a sequence of length 80, you should have an output for each hidden state. All hidden states are stacked in out so you can simply use
out = self.fc1(out)
out = self.fc2(out)
I would also note that if you must have two fully connected layers after encoding in your hidden state, you should use a non-linearity in between otherwise this is the equivalent of just one layer but with more parameters.

Related

How does it work a Multi-Layer GRU/LSTM in Pytorch

I'm trying to understand exactly how the calculation are performed in the GRU pytorch class. I'm having some troubles while reading the GRU pytorch documetation and the LSTM TorchScript documentation with its code implementation.
In the GRU documentation is stated:
In a multilayer GRU, the input xt(l)​ of the l -th layer (l>=2) is the hidden state ht(l−1)​ of the previous layer multiplied by dropout δt(l−1)​where each ​δt(l−1)​ is a Bernoulli random variable which is 0 with probability dropout.
So essentially given a sequence, each time point should be passed through all the layers for each loop, like this implementation
Meanwhile the LSTM code implementation is:
def script_lstm(input_size, hidden_size, num_layers, bias=True,
batch_first=False, dropout=False, bidirectional=False):
'''Returns a ScriptModule that mimics a PyTorch native LSTM.'''
# The following are not implemented.
assert bias
assert not batch_first
if bidirectional:
stack_type = StackedLSTM2
layer_type = BidirLSTMLayer
dirs = 2
elif dropout:
stack_type = StackedLSTMWithDropout
layer_type = LSTMLayer
dirs = 1
else:
stack_type = StackedLSTM
layer_type = LSTMLayer
dirs = 1
return stack_type(num_layers, layer_type,
first_layer_args=[LSTMCell, input_size, hidden_size],
other_layer_args=[LSTMCell, hidden_size * dirs,
hidden_size])
class LSTMCell(jit.ScriptModule):
def __init__(self, input_size, hidden_size):
super(LSTMCell, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.weight_ih = Parameter(torch.randn(4 * hidden_size, input_size))
self.weight_hh = Parameter(torch.randn(4 * hidden_size, hidden_size))
self.bias_ih = Parameter(torch.randn(4 * hidden_size))
self.bias_hh = Parameter(torch.randn(4 * hidden_size))
#jit.script_method
def forward(self, input: Tensor, state: Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
hx, cx = state
gates = (torch.mm(input, self.weight_ih.t()) + self.bias_ih +
torch.mm(hx, self.weight_hh.t()) + self.bias_hh)
ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
ingate = torch.sigmoid(ingate)
forgetgate = torch.sigmoid(forgetgate)
cellgate = torch.tanh(cellgate)
outgate = torch.sigmoid(outgate)
cy = (forgetgate * cx) + (ingate * cellgate)
hy = outgate * torch.tanh(cy)
return hy, (hy, cy)
class LSTMLayer(jit.ScriptModule):
def __init__(self, cell, *cell_args):
super(LSTMLayer, self).__init__()
self.cell = cell(*cell_args)
#jit.script_method
def forward(self, input: Tensor, state: Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
inputs = input.unbind(0)
outputs = torch.jit.annotate(List[Tensor], [])
for i in range(len(inputs)):
out, state = self.cell(inputs[i], state)
outputs += [out]
return torch.stack(outputs), state
def init_stacked_lstm(num_layers, layer, first_layer_args, other_layer_args):
layers = [layer(*first_layer_args)] + [layer(*other_layer_args)
for _ in range(num_layers - 1)]
return nn.ModuleList(layers)
class StackedLSTM(jit.ScriptModule):
__constants__ = ['layers'] # Necessary for iterating through self.layers
def __init__(self, num_layers, layer, first_layer_args, other_layer_args):
super(StackedLSTM, self).__init__()
self.layers = init_stacked_lstm(num_layers, layer, first_layer_args,
other_layer_args)
#jit.script_method
def forward(self, input: Tensor, states: List[Tuple[Tensor, Tensor]]) -> Tuple[Tensor, List[Tuple[Tensor, Tensor]]]:
# List[LSTMState]: One state per layer
output_states = jit.annotate(List[Tuple[Tensor, Tensor]], [])
output = input
# XXX: enumerate https://github.com/pytorch/pytorch/issues/14471
i = 0
for rnn_layer in self.layers:
state = states[i]
output, out_state = rnn_layer(output, state)
output_states += [out_state]
i += 1
return output, output_states
So in this case each layer does its own sequence for loop and passes another sequence tensor to the next layer.
So my question is: Which is the correct way to implement a multi-layer GRU?
I think you are misunderstanding the definition. The approach that you see in the lstm code, where each layer passes an entire sequence on to the next, is the standard approach for stacked RNN's - at least for sequence to sequence models. It's equivalent to RNN(RNN(input)).
It's also what the PyTorch GRU definition is saying, albeit, in a somewhat round-about-way. The definition is saying that for the N-th layer GRU, the input i, is the hidden state h (read: output) of the (N-1)-th layer GRU. Now, in theory, we could run all the inputs one at a time through all the layers and collect the outputs. Or we can do the entire sequence for each layer and only keep the last output sequence. This second approach should be faster, because it allows for vectorizing the calculations more efficiently.
Further, if you look at the link you sent with the two different GRU models. You'll see that the results are equivalent, whether you run the inputs through each layer one at a time using GRUCell's, or use full GRU layers.
In the Pytorch GRU Document, you would find that it contains an attribute named num_layers which allows you to specify the number of GRU layers.
If this answers your question as to how we apply the GRU layers practically?
>>> rnn = nn.GRU(input_size = 10, hidden_size = 20, num_layers = 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

Overfitting LSTM pytorch

I was following the tutorial on CoderzColumn to implement a LSTM for text classification using pytorch. I tried to apply the implementation on the bbc-news Dataset from Kaggle, however, it heavily overfits, achieving a max accuracy of about 60%.
See the train/loss curve for example:
Is there any advice (I am quite new to RNN/LSTM), to adapt the model to prevent that high overfiting?
The model is taken from the above tutorial and looks kind of like this:
class LSTMClassifier(nn.Module):
def __init__(self, vocab, target_classes, embed_len = 50, hidden_dim=75, n_layers=1):
super(LSTMClassifier, self).__init__()
self.n_layers = n_layers
self.embed_len = embed_len
self.hidden_dim = hidden_dim
self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
# self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim,dropout=0.2, num_layers=n_layers, batch_first=True)
self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
self.fc = nn.Linear(hidden_dim, len(target_classes))
def forward(self, X_batch):
embeddings = self.embedding_layer(X_batch)
hidden, carry = torch.randn(self.n_layers, len(X_batch), self.hidden_dim), torch.randn(self.n_layers, len(X_batch), self.hidden_dim)
output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
return self.fc(output[:,-1])
I would be really thankful for any adive how to adapt the version in the tutorial to use it more effectively on other datasets
Have you tried adding nn.Dropout layer before the self.fc?
Check what p = 0.1 / 0.2 / 0.3 will do.
Another thing you can do is to add regularisation to your training via weight_decay parameter:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
Use small values first, and increase by 10 times, see which will get you the best result.
Also, goes without saying, make sure that there is no test data points in train set. Make sure you did not forget to shuffle your train set:
train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)

PyTorch RNN-BiLSTM sentiment analysis low accuracy

I'm using PyTorch with a training set of movie reviews each labeled positive or negative. Every review is truncated or padded to be 60 words and I have a batch size of 32. This 60x32 Tensor is fed to an embedding layer with an embedding dim of 100 resulting in a 60x32x100 Tensor. Then I use the unpadded lengths of each review to pack the embedding output, and feed that to a BiLSTM layer with hidden dim = 256.
I then pad it back, apply a transformation (to try to get the last hidden state for the forward and backward directions) and feed the transformation to a Linear layer which is 512x1. Here is my module, I pass the final output through a sigmoid not shown here
class RNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
bidirectional, dropout, pad_idx):
super().__init__()
self.el = nn.Embedding(vocab_size, embedding_dim)
print('vocab size is ', vocab_size)
print('embedding dim is ', embedding_dim)
self.hidden_dim = hidden_dim
self.num_layers = n_layers # 2
self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers, dropout=dropout, bidirectional=bidirectional)
# Have an output layer for outputting a single output value
self.linear = nn.Linear(2*hidden_dim, output_dim)
def init_hidden(self):
return (torch.zeros(self.n_layers*2, 32, self.hidden_dim).to(device),
torch.zeros(self.n_layers*2, 32, self.hidden_dim).to(device))
def forward(self, text, text_lengths):
print('input text size ', text.size())
embedded = self.el(text)
print('embedded size ', embedded.size())
packed_seq = torch.nn.utils.rnn.pack_padded_sequence(embedded, lengths=text_lengths, enforce_sorted=False)
packed_out, (ht, ct) = self.lstm(packed_seq, None)
out_rnn, out_lengths = torch.nn.utils.rnn.pad_packed_sequence(packed_out)
print('padded lstm out ', out_rnn.size())
#out_rnn = out_rnn[-1] #this works
#out_rnn = torch.cat((out_rnn[-1, :, :self.hidden_dim], out_rnn[0, :, self.hidden_dim:]), dim=1) # this works
out_rnn = torch.cat((ht[-1], ht[0]), dim=1) #this works
#out_rnn = out_rnn[:, -1, :] #doesn't work maybe should
print('attempt to get last hidden ', out_rnn.size())
linear_out = self.linear(out_rnn)
print('after linear ', linear_out.size())
return linear_out
I've tried 3 different transformations to get the dimensions correct for the linear layer
out_rnn = out_rnn[-1] #this works
out_rnn = torch.cat((out_rnn[-1, :, :self.hidden_dim], out_rnn[0, :, self.hidden_dim:]), dim=1) # this works
out_rnn = torch.cat((ht[-1], ht[0]), dim=1) #this works
These all produce an output like this
input text size torch.Size([60, 32])
embedded size torch.Size([60,32, 100])
padded lstm out torch.Size([36, 32, 512])
attempt to get last hidden torch.Size([32, 512])
after linear torch.Size([32, 1])
I would expect the padded lstm out to be [60, 32, 512] but it is always less than 60 in the first dimension.
I'm training for 10 epochs with optim.SGD and nn.BCEWithLogitsLoss(). My training accuracy is always around 52% and test accuracy is always at like 50%, so the model is doing no better than randomly guessing. I'm sure that my data is being handled correctly in my tochtext.data.Dataset. Am I forwarding my tensors along incorrectly?
I have tried using batch_first=True in my lstm, packed_seq function, and pad_packed_seq function and that breaks my transformations before feeding to the linear layer.
Update
I added the init_hidden method and have tried without the pack/pad sequence methods and still get the same results
I changed my optimizer from SGD to Adam and changed layers from 2 to 1 and my model started to learn, getting accuracies > 75%

Find input that maximises output of a neural network using Keras and TensorFlow

I have used Keras and TensorFlow to classify the Fashion MNIST following this tutorial .
It uses the AdamOptimizer to find the value for model parameters that minimize the loss function of the network. The input for the network is a 2-D tensor with shape [28, 28], and output is a 1-D tensor with shape [10] which is the result of a softmax function.
Once the network has been trained, I want to use the optimizer for another task: find an input that maximizes one of the elements of the output tensor. How can this be done? Is it possible to do so using Keras or one have to use a lower level API?
Since the input is not unique for a given output, it would be even better if we could impose some constraints on the values the input can take.
The trained model has the following format
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
I feel you would want to backprop with respect to the input freezing all the weights to your model. What you could do is:
Add a dense layer after the input layer with the same dimensions as input and set it as trainable
Freeze all the other layers of your model. (except the one you added)
As an input, feed an identity matrix and train your model based on whatever output you desire.
This article and this post might be able to help you if you want to backprop based on the input instead. It's a bit like what you are aiming for but you can get the intuition.
It would be very similar to the way that filters of a Convolutional Network is visualized: we would do gradient ascent optimization in input space to maximize the response of a particular filter.
Here is how to do it: after training is finished, first we need to specify the output and define a loss function that we want to maximize:
from keras import backend as K
output_class = 0 # the index of the output class we want to maximize
output = model.layers[-1].output
loss = K.mean(output[:,output_class]) # get the average activation of our desired class over the batch
Next, we need to take the gradient of the loss we have defined above with respect to the input layer:
grads = K.gradients(loss, model.input)[0] # the output of `gradients` is a list, just take the first (and only) element
grads = K.l2_normalize(grads) # normalize the gradients to help having an smooth optimization process
Next, we need to define a backend function that takes the initial input image and gives the values of loss and gradients as outputs, so that we can use it in the next step to implement the optimization process:
func = K.function([model.input], [loss, grads])
Finally, we implement the gradient ascent optimization process:
import numpy as np
input_img = np.random.random((1, 28, 28)) # define an initial random image
lr = 1. # learning rate used for gradient updates
max_iter = 50 # number of gradient updates iterations
for i in range(max_iter):
loss_val, grads_val = func([input_img])
input_img += grads_val * lr # update the image based on gradients
Note that, after this process is finished, to display the image you may need to make sure that all the values in the image are in the range [0, 255] (or [0,1]).
After the hints Saket Kumar Singh gave in his answer, I wrote the following that seems to solve the question.
I create two custom layers. Maybe Keras offers already some classes that are equivalent to them.
The first on is a trainable input:
class MyInputLayer(keras.layers.Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(MyInputLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel',
shape=self.output_dim,
initializer='uniform',
trainable=True)
super(MyInputLayer, self).build(input_shape)
def call(self, x):
return self.kernel
def compute_output_shape(self, input_shape):
return self.output_dim
The second one gets the probability of the label of interest:
class MySelectionLayer(keras.layers.Layer):
def __init__(self, position, **kwargs):
self.position = position
self.output_dim = 1
super(MySelectionLayer, self).__init__(**kwargs)
def build(self, input_shape):
super(MySelectionLayer, self).build(input_shape)
def call(self, x):
mask = np.array([False]*x.shape[-1])
mask[self.position] = True
return tf.boolean_mask(x, mask,axis=1)
def compute_output_shape(self, input_shape):
return self.output_dim
I used them in this way:
# Build the model
layer_flatten = keras.layers.Flatten(input_shape=(28, 28))
layerDense1 = keras.layers.Dense(128, activation=tf.nn.relu)
layerDense2 = keras.layers.Dense(10, activation=tf.nn.softmax)
model = keras.Sequential([
layer_flatten,
layerDense1,
layerDense2
])
# Compile the model
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train the model
# ...
# Freeze the model
layerDense1.trainable = False
layerDense2.trainable = False
# Build another model
class_index = 7
layerInput = MyInputLayer((1,784))
layerSelection = MySelectionLayer(class_index)
model_extended = keras.Sequential([
layerInput,
layerDense1,
layerDense2,
layerSelection
])
# Compile it
model_extended.compile(optimizer=tf.train.AdamOptimizer(),
loss='mean_absolute_error')
# Train it
dummyInput = np.ones((1,1))
target = np.ones((1,1))
model_extended.fit(dummyInput, target,epochs=300)
# Retrieve the weights of layerInput
layerInput.get_weights()[0]
Interesting. Maybe a solution would be to feed all your data to the network and for each sample save the output_layer after softmax.
This way, for 3 classes, where you want to find the best input for class 1, you are looking for outputs where the first component is high. For example: [1 0 0]
Indeed the output means the probability, or the confidence of the network, for the sample being one of the classes.
Funny coincident I was just working on the same "problem". I'm interested in the direction of adversarial training etc. What I did was to insert a LocallyConnected2D Layer after the input and then train with data which is all one and has as targets the class of interest.
As model I use
batch_size = 64
num_classes = 10
epochs = 20
input_shape = (28, 28, 1)
inp = tf.keras.layers.Input(shape=input_shape)
conv1 = tf.keras.layers.Conv2D(32, kernel_size=(3, 3),activation='relu',kernel_initializer='he_normal')(inp)
pool1 = tf.keras.layers.MaxPool2D((2, 2))(conv1)
drop1 = tf.keras.layers.Dropout(0.20)(pool1)
flat = tf.keras.layers.Flatten()(drop1)
fc1 = tf.keras.layers.Dense(128, activation='relu')(flat)
norm1 = tf.keras.layers.BatchNormalization()(fc1)
dropfc1 = tf.keras.layers.Dropout(0.25)(norm1)
out = tf.keras.layers.Dense(num_classes, activation='softmax')(dropfc1)
model = tf.keras.models.Model(inputs = inp , outputs = out)
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=tf.keras.optimizers.RMSprop(),
metrics=['accuracy'])
model.summary()
after training I insert the new layer
def insert_intermediate_layer_in_keras(model,position, before_layer_id):
layers = [l for l in model.layers]
if(before_layer_id==0) :
x = new_layer
else:
x = layers[0].output
for i in range(1, len(layers)):
if i == before_layer_id:
x = new_layer(x)
x = layers[i](x)
else:
x = layers[i](x)
new_model = tf.keras.models.Model(inputs=layers[0].input, outputs=x)
return new_model
def fix_model(model):
for l in model.layers:
l.trainable=False
fix_model(model)
new_layer = tf.keras.layers.LocallyConnected2D(1, kernel_size=(1, 1),
activation='linear',
kernel_initializer='he_normal',
use_bias=False)
new_model = insert_intermediate_layer_in_keras(model,new_layer,1)
new_model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=tf.keras.optimizers.RMSprop(),
metrics=['accuracy'])
and finally rerun training with my fake data.
X_fake = np.ones((60000,28,28,1))
print(Y_test.shape)
y_fake = np.ones((60000))
Y_fake = tf.keras.utils.to_categorical(y_fake, num_classes)
new_model.fit(X_fake, Y_fake, epochs=100)
weights = new_layer.get_weights()[0]
imshow(weights.reshape(28,28))
plt.show()
Results are not yet satisfying but I'm confident of the approach and guess I need to play around with the optimiser.

Building recurrent neural network with feed forward network in pytorch

I was going through this tutorial. I have a question about the following class code:
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax()
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def init_hidden(self):
return Variable(torch.zeros(1, self.hidden_size))
This code was taken from Here. There it was mentioned that
Since the state of the network is held in the graph and not in the layers, you can simply create an nn.Linear and reuse it over and over again for the recurrence.
What I don't understand is, how can one just increase input feature size in nn.Linear and say it is a RNN. What am I missing here?
The network is recurrent, because you evaluate multiple timesteps in the example.
The following code is also taken from the pytorch tutorial you linked to.
loss_fn = nn.MSELoss()
batch_size = 10
TIMESTEPS = 5
# Create some fake data
batch = torch.randn(batch_size, 50)
hidden = torch.zeros(batch_size, 20)
target = torch.zeros(batch_size, 10)
loss = 0
for t in range(TIMESTEPS):
# yes! you can reuse the same network several times,
# sum up the losses, and call backward!
hidden, output = rnn(batch, hidden)
loss += loss_fn(output, target)
loss.backward()
So the network itself is not recurrent, but in this loop you use it as a recurrent network by feeding the hidden state of the previous forward step together with your batch-input multiple times.
You could also use it non-recurrent by just backpropagating the loss in every step and ignoring the hidden state.
Since the state of the network is held in the graph and not in the layers, you can simply create an nn.Linear and reuse it over and over again for the recurrence.
This means, that the information to compute the gradient is not held in the model itself, so you can append multiple evaluations of the module to the graph and then backpropagate through the full graph.
This is described in the previous paragraphs of the tutorial.

Categories

Resources