Predicting Fibonacci Using LSTM RNN - python

New to neural nets so please correct my syntax.
I'm trying to create a LSTM RNN that will predict the Fibonacci sequence. When I ran the code below, the loss remains incredibly high (around 35339663592701874176).
Why does the shape of the input have to be (batch_size, timesteps, input_dim)? In my example I have 100 data entries so that'd be my batch_size, and the Fibonacci sequence takes in 2 inputs so that'd be input_dim but what would timesteps be in this case? 1?
Shouldn't the the units of the LSTM be 1? If I'm understanding correctly, the "units" are just the amount of hidden state nodes that are in the LSTM. So in theory, each of the 2 inputs would have a "1" coefficient weight towards that hidden state after training.
Would an RNN be a suitable model for this problem? When I've looked online, most people like to use the Fibonacci sequence as an example to explain how RNN's work.
Thanks for the help!
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create Training Data
xs = [[[1, 1]]]
ys = []
i = 0
while i < 100:
ys.append([xs[i][0][0]+xs[i][0][1]])
xs.append([[xs[i][0][1], ys[len(ys)-1][0]]])
i = i + 1
del xs[len(xs)-1]
xs = np.array(xs, dtype=float)
ys = np.array(ys, dtype=float)
# Create Model
model = keras.Sequential()
model.add(layers.LSTM(1, input_shape=(1, 2)))
model.add(layers.Dense(1))
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=[ 'accuracy' ])
# Train
model.fit(xs, ys, epochs=100000)

You can't feed a NN data where some of the values are 10^21 times as large as some of the others and expect it to work, it just doesn't happen.
You're not doing anything here that actually calls for LSTM (or any RNN), you're not actually using the time dimension, and you're basically just trying to learn addition. Maybe you meant to do something different (like input digits as a sequence, or have the output run for multiple timesteps and give you several values of the sequence), but that's not what you're doing, and it's unclear what you want.
The number of units is your memory/procesing capacity. Each unit of an RNN is able to receive values from all of the units in the previous timestep. One unit alone can't do anything interesting, especially with no layer before it to preprocess the data.

Related

Why is this ML model giving me zero accuracy?

I am trying to train a network on the Swiss Roll dataset with three features X = [x1, x2, x3] for the classification task. There are four classes with labels 1, 2, 3, 4, and the vector y contains the labels for all the data.
A row in the X matrix looks like this:
-5.2146470e+00 7.0879738e+00 6.7292474e+00
The shape of X is (100, 3), and the shape of y is (100,).
I want to use Radial Basis Functions to train this model. I have used the custom RBFLayer from this StackOverflow answer (also see this explanation) to build the RBFLayer. I want to use a couple of Keras Dense layers to build the network for classification.
What I have tried so far
I have used a Dense layer for the first layer, followed by the custom RBFLayer, and two other Dense layers. Here's the code:
model = Sequential()
model.add((Dense(100, input_dim=3)))
# number of units = 10, gamma = 0.05
model.add(RBFLayer(10,0.05))
model.add(Dense(15, activation='relu'))
model.add(Dense(1, activation='softmax'))
This model gives me zero accuracy. I think there is something wrong with the model architecture, but I can't figure out what is the issue.
Also, I thought the number of units in the last Dense layer should match the number of classes, which is 4 in this case. But when I set the number of units to 4 in the last layer, I get the following error:
ValueError: Shapes (None, 1) and (None, 4) are incompatible
Can you help me with this model architecture?
I faced the same issue while practicing with multi-class classification. Where I had 7 features and the model classifies into 7 classes. I tried encoding the labels and it fixed the issue.
First import LabelEncoder class from sklearn and import to_categorical from tensorflow
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
Then, initialize an object to the LabelEncoder class and transform your labels before fitting and training the model.
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)
y = to_categorical(y)
Note that you have to use np.argmax for getting the actual predicted classification. in my case, the prediction is stored in variable called res
res = np.argmax(res, axis=None, out=None)
You can get your actual predicted class after this line. Looking forward to help you. Hope it solved your problem.
There are four classes with labels 1, 2, 3, 4, and the vector y contains the labels for all the data.
The simplest solution for input output matching is that you print the shape of the inputs and output for a single batch and then compare.
RBF layer should have no problem because output is taken from last dense layer rather then RBF layer.
With classification problem you must have last nodes equal to classes in regression the last node is 1 sometimes.
you should print
pseudo code
print(input.shape)
compare it with
print(model.input_shape)
then at output
print(output.shape)
then compare it with
print(model.predict(input).shape)
you can find the correct syntax at keras docs these are approx correct syntax / pseudo

Network loss stalls where it should fall to zero quickly

I have a neural network with 30 input nodes, 1 hidden node, and 1 output node. I am training it on a dataset where the inputs are 30-dimensional vectors with entries between -1 and 1, and the targets are the 2nd entry of these vectors.
I expect the network to train and learn to output the 2nd entry of the input vector quickly, since this is as simple as decreasing the weights in the network which connect the input nodes to the hidden node to zero except the one for the 2nd entry.
However, The loss stalls quickly at approximately 0.168. I'ld expect it to quickly go to zero, which is the case if the targets are just 0.
The following code showcases the problem with a randomised dataset.
import numpy as np
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow as tf
np.random.seed(123)
dataSize = 100000
xdata = np.zeros((dataSize, 30))
ydata = np.zeros((dataSize))
for i in range(dataSize):
vec = (np.random.rand(30) * 2) - 1
xdata[i] = vec
ydata[i] = vec[1]
model = models.Sequential()
model.add(layers.Dense(1, activation="relu", input_shape=(30, )))
model.add(layers.Dense(1, activation="sigmoid"))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
lossObject = tf.keras.losses.MeanSquaredError()
model.compile(optimizer=optimizer, loss=lossObject)
model.fit(xdata, ydata, epochs=200, batch_size=32)
I have tried multiple different optimizers, loss functions, batch sizes, dataset sizes and learning rates, however the result is always the loss stalling at a relatively high value.
Why is this happening? I am not interested in responses asking why I am doing this. I am new to neural networks and I need to understand why this is happening before I can continue with my original task.
Thank you in advance.
Your targets are between -1 and 1, but a sigmoid output activation limits outputs to [0, 1], making it impossible to achieve zero loss if any targets happen to be < 0 (which is very likely with a large dataset). You could fix it by using tanh as activation instead, which maps to [-1, 1], or just using no activation in the output layer should be fine in this case. When you fix all targets to 0, this is obviously not an issue, and (almost) zero loss can be achieved.
As a general lesson: Always make sure your output activation makes sense with regards to your target data. At the very least, the value ranges should be identical -- although this might not be a sufficient condition for a good output activation.
As a second point: Having a single node with relu activation is also a bad idea. If the input to relu is < 0, the output will be 0, and the gradient will be, as well. In this case, no learning is possible and incorrect outputs for some data points may never be corrected.
It is generally not a problem if some units are 0 some of the time because the gradient can flow through other paths, but with only one unit, this will likely cripple learning as well. I would recommend that you either use more units in the hidden layer, or use a different activation function.

How to use Multivariate time-series prediction with Keras, when multiple samples are used

As the title states, I am doing multivariate time-series prediction. I have some experience with this situation and was able to successfully setup and train a working model in TF Keras.
However, I did not know the 'proper' way to handle having multiple unrelated time-series samples. I have about 8000 unique sample 'blocks' with anywhere from 800 time steps to 30,000 time steps per sample. Of course I couldn't concatenate them all into one single time series because the first points of sample 2 are not related in time with the last points of sample 1.
Thus my solution was to fit each sample individually in a loop (at great inefficiency).
My new idea is can/should I pad the start of each sample with empty time-steps = to the amount of look back for the RNN and then concatenate the padded samples into one time-series? This will mean that the first time-step will have a look-back data of mostly 0's which sounds like another 'hack' for my problem and not the right way to do it.
The main challenge is in 800 vs. 30,000 timesteps, but nothing you can't do.
Model design: group sequences into chunks - for example, 30 sequences of 800-to-900 timesteps, padded, then 60 sequences of 900-to-1000, etc. - don't have to be contiguous (i.e. next can be 1200-to-1500)
Input shape: (samples, timesteps, channels) - or equivalently, (sequences, timesteps, features)
Layers: Conv1D and/or RNNs - e.g. GRU, LSTM. Each can handle variable timesteps
Concatenation: don't do it. If each of your sequences is independent, then each must be fed along dimension 0 in Keras - the batch or samples dimension. If they are dependent, e.g. multivariate timeseries, like many channels in a signal - then feed them along the channels dimension (dim 2). But never concatenate along timeseries dimension, as it implies causal continuity whrere none exists.
Stateful RNNs: can help in processing long sequences - info on how they work here
RNN capability: is limited w.r.t. long sequences, and 800 is already in danger zone even for LSTMs; I'd suggest dimensionality reduction via either autoencoders or CNNs w/ strides > 1 at input, then feeding their outputs to RNNs.
RNN training: is difficult. Long train times, hyperparameter sensitivity, vanishing gradients - but, with proper regularization, they can be powerful. More info here
Zero-padding: before/after/both - debatable, can read about it, but probably stay clear from "both" as learning to ignore paddings is easier with one locality; I personally use "before"
RNN variant: use CuDNNLSTM or CuDNNGRU whenever possible, as they are 10x faster
Note: "samples" above, and in machine learning, refers to independent examples / observations, rather than measured signal datapoints (which would be referred to as timesteps).
Below is a minimal code for what a timeseries-suited model would look like:
from tensorflow.keras.layers import Input, Conv1D, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np
def make_data(batch_shape): # dummy data
return (np.random.randn(*batch_shape),
np.random.randint(0, 2, (batch_shape[0], 1)))
def make_model(batch_shape): # example model
ipt = Input(batch_shape=batch_shape)
x = Conv1D(filters=16, kernel_size=10, strides=2, padding='valid')(ipt)
x = LSTM(units=16)(x)
out = Dense(1, activation='sigmoid')(x) # assuming binary classification
model = Model(ipt, out)
model.compile(Adam(lr=1e-3), 'binary_crossentropy')
return model
batch_shape = (32, 100, 16) # 32 samples, 100 timesteps, 16 channels
x, y = make_data(batch_shape)
model = make_model(batch_shape)
model.train_on_batch(x, y)

Keras RNN Regression Input dimensions and architecture

I have recently constructed a CNN in Keras (with Tensorflow as this backend) that takes stellar spectra as an input and predicts three stellar parameters as outputs: Temperature, Surface Gravity, and Metallicity. I am now trying to create an RNN that does the same thing in order to compare the two models.
After searching through examples and forums I haven't come across many applications that are similar enough to my project. I have tried implementing a simple RNN to see if I can come up with sensible results, but no luck so far: the networks don't seem to be learning at all.
I could really use some guidance to get me started. Specifically:
Is an RNN an appropriate network for this type of problem?
What is the correct input shape for the model? I know this depends on the architecture of the network, so I guess that my next question would be: what is a simple architecture to start with that is capable of computing regression predictions?
My input data is such that I have m=50,000 spectra each with n=7000 data points, and L=3 output labels that I am trying to learn. I also have test sets and cross-validation sets with the same n & L dimensions.
When structuring my input data as (m,n,1) and my output targets as (m,L) and using the following architecture, the loss doesn't seem to decrease.
n=7000
L=3
## train_X.shape = (50000, n, 1)
## train_Y.shape = (50000, L)
## cv_X.shape = (10000, n, 1)
## cv_Y.shape = (10000, L)
batch_size=32
lstm_layers = [16, 32]
input_shape = (None, n, 1)
model = Sequential([
InputLayer(batch_input_shape=input_shape),
LSTM(lstm_layers[0],return_sequences=True, dropout_W=0.2, dropout_U=0.2),
LSTM(lstm_layers[1], return_sequences=False),
Dense(L),
Activation('linear')
])
model.compile(loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_X, train_Y, batch_size=batch_size, nb_epoch=20,
validation_data=(cv_X, cv_Y), verbose=2)
I have also tried changing my input shape to (m, 1, n) and still haven't had any success. I am not looking for an optimal network, just something that trains and then I can take it from there. My input data isn't in time-series, but there are relationships between one section of the spectrum and the previous section, so is there a way I can structure each spectrum into a 2D array that will allow an RNN to learn stellar parameters from the spectra?
First you set
train_X.shape = (50000, n, 1)
Then you write
input_shape = (None, 1, n)
Why don't you try
input_shape = (None, n, 1) ?
It would make more sense for your RNN to receive a sequence of n timesteps and 1 value per time step than the other way around.
Does it help? :)
**EDIT : **
Alright, after re-reading here is my 2cents for your questions : the LSTM isn't a good idea.
1) because there is no "temporal" information, there isnt a "direction" in the spectrum information. The LSTM is good at capturing a changing state of the world for example. It's not the best at combining informations at the beginning of your spectrum with information at the end. It will "read" starting from the beginning and that information will fade as the state is updated. You could try a bidirectionnal LSTM to counter the fact that there is "no direction". However, go to the second point.
2) 7000 timesteps is wayyy too much for an LSTM to work. When it trains, at the backpropagation step, the LSTM is unrolled and the information will have to go through "7000 layers" (not actually 7000 because they have the same weights). This is very very difficult to train. I would limit LSTM to 100 steps at max (from my experience).
Otherwise your input shape is correct :)
Have you tried a deep fully connected network?! I believe this would be more efficient.

Understanding Keras LSTMs

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is,
The reshaping of the data series into [samples, time steps, features] and,
The stateful LSTMs
Lets concentrate on the above two questions with reference to the code pasted below:
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
Note: create_dataset takes a sequence of length N and returns a N-look_back array of which each element is a look_back length sequence.
What is Time Steps and Features?
As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to one case, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).
Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?
Stateful LSTMs
Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_size is one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I'm guessing this is related to the fact that training data is not shuffled, but I'm not sure how.
Any thoughts?
Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Edit 1:
A bit confused about #van's comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_size was arbitrarily chosen.):
Edit 2:
For people who have done Udacity's deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169
Update:
It turns out model.add(TimeDistributed(Dense(vocab_len))) was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot
Update2:
I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU
As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.
General Keras behavior
The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):
In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.
For this example:
We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
Pressure P
Temperature T
Our input array should then be something shaped as (N,5,2):
[ Step1 Step2 Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
....
Tank N: [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
]
Inputs for sliding windows
Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.
In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:
[ Step1 Step2 Step3 Step4 Step5
Window A: [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window B: [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window C: [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
....
]
Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.
The concept of "what is a sequence" is abstract. The important parts are:
you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)
Achieving each case with "single layers"
Achieving standard many to many:
You can achieve many to many with a simple LSTM layer, using return_sequences=True:
outputs = LSTM(units, return_sequences=True)(inputs)
#output_shape -> (batch_size, steps, units)
Achieving many to one:
Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:
outputs = LSTM(units)(inputs)
#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned
Achieving one to many
Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:
Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)
One to many with repeat vector
In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:
outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)
#output_shape -> (batch_size, steps, units)
Understanding stateful = True
Now comes one of the possible usages of stateful=True (besides avoiding loading data that can't fit your computer's memory at once)
Stateful allows us to input "parts" of the sequences in stages. The difference is:
In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.
It's like dividing the sequences in windows too, with these two main differences:
these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence
In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).
Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.
Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:
BATCH 1 BATCH 2
[ Step1 Step2 | [ Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], | [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], | [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
.... |
Tank N: [[Pn1,Tn1], [Pn2,Tn2], | [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
] ]
Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False (unless we are using only one sequence, of course).
You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).
One to many with stateful=True
For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.
Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.
Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.
Layer:
outputs = LSTM(units=features,
stateful=True,
return_sequences=True, #just to keep a nice output shape even with length 1
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Now, we're going to need a manual loop for predictions:
input_data = someDataWithShape((batch, 1, features))
#important, we're starting new sequences, not continuing old ones:
model.reset_states()
output_sequence = []
last_step = input_data
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
Many to many with stateful=True
Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.
We're using the same method as in the "one to many" above, with the difference that:
we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).
Layer (same as above):
outputs = LSTM(units=features,
stateful=True,
return_sequences=True,
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Training:
We are going to train our model to predict the next step of the sequences:
totalSequences = someSequencesShaped((batch, steps, features))
#batch size is usually 1 in these cases (often you have only one Tank in the example)
X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X
#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
model.reset_states()
model.train_on_batch(X,Y)
Predicting:
The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:
model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step
Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)
output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
This approach was used in these answers and file:
Predicting a multiple forward time step of a time series using LSTM
how to use the Keras model to forecast for future dates or events?
https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
Achieving complex configurations
In all examples above, I showed the behavior of "one layer".
You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.
One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:
Encoder:
inputs = Input((steps,features))
#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)
#many to one layer:
outputs = LSTM(hidden3)(outputs)
encoder = Model(inputs,outputs)
Decoder:
Using the "repeat" method;
inputs = Input((hidden3,))
#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)
#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)
#last layer
outputs = LSTM(features,return_sequences=True)(outputs)
decoder = Model(inputs,outputs)
Autoencoder:
inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
Train with fit(X,X)
Additional explanations
If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`
First of all, you choose great tutorials(1,2) to start.
What Time-step means: Time-steps==3 in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.
many to many vs. many to one: In keras, there is a return_sequences parameter when your initializing LSTM or GRU or SimpleRNN. When return_sequences is False (by default), then it is many to one as shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequences is True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)
Does the features argument become relevant: Feature argument means "How big is your red box" or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.
Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven't turn on stateful yet. However, I disagree with that the batch_size can only be 1 when stateful==True.
Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data online while training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.
When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.
Here is an example piece of code this might help others.
words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = "input")
# Build a matrix of size vocabularySize x EmbeddingDimension
# where each row corresponds to a "word embedding" vector.
# This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
name = "embeddings")(words)
# Pass the word-vectors to the LSTM layer.
# We are setting the hidden-state size to 512.
# The output will be batchSize x maxSequenceLength x hiddenStateSize
hiddenStates = keras.layers.GRU(512, return_sequences = True,
input_shape=(self.maxSequenceLength,
self.EmbeddingDimension),
name = "rnn")(embeddings)
hiddenStates2 = keras.layers.GRU(128, return_sequences = True,
input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
name = "rnn2")(hiddenStates)
denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize),
name = "linear")(hiddenStates2)
predictions = TimeDistributed(keras.layers.Activation("softmax"),
name = "softmax")(denseOutput)
# Build the computational graph by specifying the input, and output of the network.
model = keras.models.Model(input = words, output = predictions)
# model.compile(loss='kullback_leibler_divergence', \
model.compile(loss='sparse_categorical_crossentropy', \
optimizer = keras.optimizers.Adam(lr=0.009, \
beta_1=0.9,\
beta_2=0.999, \
epsilon=None, \
decay=0.01, \
amsgrad=False))
Refer this blog for more details Animated RNN, LSTM and GRU.
The figure below gives you a better view of LSTM. It's a LSTM cell.
As you can see, X has 3 features (green circles) so input of this cell is a vector of dimension 3 and hidden state has 2 units (red circles) so the output of this cell (and also cell state) is a vector of dimension 2.
An example of one LSTM layer with 3 timesteps (3 LSTM cells) is shown in the figure below:
** A model can have multiple LSTM layers.
Now I use Daniel Möller's example again for better understanding:
We have 10 oil tanks. For each of them we measure 2 features: temperature, pressure every one hour for 5 times.
now parameters are:
batch_size = number of samples used in one forward/backward pass (default=32) --> for example if you have 1000 samples and you set up the batch_size to 100 then the model will take 10 iterations to pass all of the samples once through network (1 epoch). The higher the batch size, the more memory space you'll need. Because the number of samples in this example are low, we consider batch_size equal to all of samples = 10
timesteps = 5
features = 2
units = It's a positive integer and determines the dimension of hidden state and cell state or in other words the number of parameters passed to next LSTM cell. It can be chosen arbitrarily or empirically based on the features and timesteps. Using more units will result in more accuracy and also more computational time. But it may cause over fitting.
input_shape = (batch_size, timesteps, features) = (10,5,2)
output_shape:
(batch_size, timesteps, units) if return_sequences=True
(batch_size, units) if return_sequences=False

Categories

Resources