LSTM example with MNIST

LSTM example with MNIST - python

I am currently trying to understand the meaning of outputs and states of the tf.nn.rnn function in tensorflow:
outputs, states = tf.nn.rnn(lstm_cell, x, dtype=tf.float32)
through the LSTM MNIST tutorial.
Indeed, with respect to the following tutorial, Understanding LSTM, I am wondering what correspond to these variables.
In my opinion, outputs correspond to the hidden state (denoted as h_t in the previous link) but I am not sure.
Thus, I understand that outputs is a list of time_steps tensors of shape (batch_size, n_hidden). But why does states is a a list of 2 tensors of shape (batch_size, n_hidden). Is it just the cell state for the last time step?

This seems like an old question. Sorry if you figured it out already.
The outputs are the values coming out of the "top" of your LSTM units in the diagram. That is the output of your LSTM layer in total.
As you suggest in your question, the hidden states are actually the temporal hidden states that move from one time period to the next. They are the C_t and h_t values that move along the time axis as output/input to the unrolled LSTM layer. That is why you get 2 tensors each with size (batch_size,n_hidden)

Related

Neural network hidden layer vs. Convolutional hidden layer intuition

If there are 10 features and 1 output class (sigmoid activation) with a regression objective:
If I use only 5 neurons in my first dense hidden layer: will the first error be calculated solely based on half of the training feature set? Isn't it imperative to match the # of features with the neurons in hidden layer #1 so that the model can see all the features at once? Otherwise it's not getting the whole picture? The first fwd propagation iteration would use 5 out of 10 features, and get the error value (and train during backprop, assume batch grad descent). Then the 2nd fwd propagation iteration would see the remaining 5 out of 10 features with updated weights and hopefully arrive at a smaller error. BUT its only seeing half the features at a time!
Conversely, if I have a convolutional 2D layer of 64 neurons. And my training shape is: (100, 28,28,1) (pictures of cats and dogs in greyscale), will each of the 64 neurons see a different 28x28 vector? No right, because it can only send one example through the forward propagation at a time? So then only a single picture (cat or dog) should be spanned across the 64 neurons? Why would you want that since each neuron in that layer has the same filter, stride, padding and activation function? When you define a Conv2D layer...the parameters of each neuron are the same. So is only a part of the training example going into each neuron? Why have 64 neurons, for example? Just have one neuron, use a filter on it and pass it along to a second hidden layer with another filter with different parameters!
Please explain the flaws in my logic. Thanks so much.
EDIT: I just realized for Conv2D, you flatten the training data sets so it becomes a 1D vector and so a 28x28 image would mean having an input conv2d layer of 724 neurons. But I am still confused for the dense neural network (paragraph #1 above)

What is your "first" layer?
Normally you have an input layer as first layer, which does not contain any weights.
The shape of the input layer must match the shape of your feature data.
So basically when you train a model with 10 features, but only have a input layer of shape (None,5) (where none stands for the batch_size), tensorflow will raise an exception, because it needs data for all inputs in the correct shape.
So what you said is just not going to happen. If you only have 5 features, the next 5 features wont be fit into the net in the next iteration but, the next sample will be send to the model instead. (Lets say no exception is thrown) So of the next sample also only the first 5 features would be used.
What you can do instead, use a input_layer as first layer with the correct shape of your features. Then as secodn layer, you can use any shape you like, 1,10,100 dense neurons, its up to you (and what works well of course). The shape of the output again must match (this time) the shape of your label data.
I hope this makes it more clear

RNN : understanfingConcatenating layers

I am trying to understanding concatenating of layers in tensorflow keras.
Below I have drew what I think is the concatenation of 2 RNN layers [ Spare for picture clarity] and the output
Here I am trying to concatenate two RNN layers. One layer has longitudinal data[ integer valued ] of patients in some time sequence and other layer has again details of same patients of other time sequence with categorical input.
I don't want these two different time sequences to be mixed up since it is medical data. So I am trying this. But before that I want to be sure if what I have drawn is what concatenating of two layers means.
Below is my code. It appears to work well but I want to confirm if my what i drew and what is implemented are correct .
#create simpleRNN with one sequence of input
first_input = Input(shape=(4, 7),dtype='float32')
simpleRNN1 = layers.SimpleRNN(units=25,bias_initializer= initializers.RandomNormal(stddev=0.0001),
activation="relu",kernel_initializer= "random_uniform")(first_input)
#another layer of RNN
second_input = Input(shape=(16,1),dtype='float32')
simpleRNN2 = layers.SimpleRNN(units=25,bias_initializer= initializers.RandomNormal(stddev=0.0001),
activation="relu",kernel_initializer= "random_uniform")(second_input)
#concatenate two layers,stack dense layer on top
concat_lay = tf.keras.layers.Concatenate()([simpleRNN1, simpleRNN2])
dens_lay = layers.Dense(64, activation='relu')(concat_lay)
dens_lay = layers.Dense(32, activation='relu')(dens_lay)
dens_lay = layers.Dense(1, activation='sigmoid')(dens_lay)
model = tf.keras.Model(inputs=[first_input, second_input], outputs= [dens_lay])
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=["accuracy"],lr=0.001)
model.summary()

Concatenation means 'chaining together' or 'unification' here, making a union of two enities.
i think your problem is addressed in https://datascience.stackexchange.com/questions/29634/how-to-combine-categorical-and-continuous-input-features-for-neural-network-trai (How to combine categorical and continuous input features for neural network training)
If you have biomedical data, i.e. ECG, as the continuous data and diagnoses as the categorical data i would consider ensemble learning as the best ansatz.
What is the best solution here depends on the details of your problem ...
Building an ensembleof two neural nets is described in https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

Yes, what you have implemented is correct (in comparison to the diagram). To be exact it is doing the following. Here, the blue nodes denote Inputs/Outputs and N denote None (this is the batch dimension).
But just to add few notes,
I am assuming that you want to mix the two outputs of the RNNs at the first Dense layer (with 64 units) because after that, there's no telling which input is which.
When you use Concatenate as a rule of thumb specify the axis you need to concatenate on (defaults to -1). Here as you have two inputs (None, 25) and (None, 25), axis=-1 works out fine. But it is always good to be specific, otherwise you might end up with bizarre results when you're implementing complex models.

Trying to understand keras SimpleRNN

I have a medical longitudinal data on which I am doing a research.
To start with I am working with 4000 rows of sample with 3 time-steps(3 columns) of a bone size corresponding to size of bone measured in 3 different months.
I am done with the basic model. Now I want to be sure if my understanding of the model is correct.
model = Sequential()
model.add(layers.SimpleRNN(units=10, input_shape=(3,1),use_bias=True,bias_initializer='zeros',activation="relu",kernel_initializer="random_uniform"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='sgd')
model.summary()
model.fit(trainX,train_op, epochs=100, batch_size=50, verbose=2)
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
Following are my few doubts around this model :
Here return_sequences is False , then shouldn't I get only the last output from RNN layer. Why the output is of shape(None, 10) from RNN layer? I assumed it should be (sample size,1) .
Also my below mentioned logic is flawed but I need to resolve it which is :
Units corresponds to output units. Initially my guess was that since there are 3 time-steps there has to be 3 output units but I was surprised that even if give units= 128 or 10,1 the model worked. How and why it is happening ? This question along with the above one confuses me more.
input_shape corresponds to -[sample size, number of time steps, features]. Here, I am measuring 1 bone size over 3 time periods. Is my understanding correct when I say the input shape is (sample size, 3,1) ? Moreover, I have confusion regarding how numpy represents 3d array. It seems, to get required dimension I need to input as - #features, observations/sample_size, timesteps . Do I have to reshape my inputs according to how numpy represents 3d or should i let it be. ?
Moreover, how can I build a model if i have different set of features measured over different time frame or have various time steps ? How can i incorporate with the above model.

Yes, you get the last output, which is a 10-dimensional vector, not a one-dimensional vector, so getting shape (samples, 10) is correct.
Number of units has nothing to do with timesteps, the number of timesteps is how many times the neurons are applied recurrently, so its orthogonal to the number of features or units.
Yes, shape of your inputs should be (samples, 3, 1) and the input_shape should be (3, 1), all of this is correct in your code. I am not sure what you are talking about on "how numpy represents 3d array", the shape is clear, numpy does not do any modifications to input shapes.

The Diagram explanation of the LSTM Network

I am working with LSTM for my time series forecasting problem. I have the following network:
model = Sequential()
model.add(LSTM(units_size=300, activation=activation, input_shape=(20, 1)))
model.add(Dense(20))
My forecasting problem is to forecast the next 20 time steps looking back the last 20 time steps. So, for each iteration, I have an input shape like (x_t-20...x_t) and forecast the next (x_t+1...x_t+20). For the hidden layer, I use 300 hidden units.
As LSTM is different than the simple feed-forward neural network, I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells? As I describe above, I have 20 time steps to predict and are these all steps generated from the last LSTM cels? I have no idea. Can some generally give a diagram example of this kind of network structure?

Regarding these questions,
I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells?
It is simpler to consider the LSTM layer you have defined as a single block. This diagram is heavily borrowed from Francois Chollet's Deep Learning with Python book:
In your model, input shape is defined as (20,1), so you have 20 time-steps of size 1. For a moment, consider that the output Dense layer is not present.
model = Sequential()
model.add(LSTM(300, input_shape=(20,1)))
model.summary()
lstm_7 (LSTM) (None, 300) 362400
The output shape of the LSTM layer is 300 which means the output is of size 300.
output = model.predict(np.zeros((1, 20, 1)))
print(output.shape)
(1, 300)
input (1,20,1) => batch size = 1, time-steps = 20, input-feature-size = 1.
output (1, 300) => batch size = 1, output-feature-size = 300
Keras recurrently ran the LSTM for 20 time-steps and generated an output of size (300). In the diagram above, this is Output t+19.
Now, if you add the Dense layer after LSTM, the output will be of size 20 which is straightforward.

To understand LSTMs, I'd recommend first spending a few minutes to understand 'plain vanilla' RNNs, as LSTMs are just a more complex version of that. I'll try to describe what's happening in your network if it was a basic RNN.
You are training a single set of weights that are repeatedly used for each time step (t-20,...,t). The first weight (let's say W1) is for inputs. One by one, each of x_t-20,...,x_t is multiplied by W1, then a non-linear activation function is applied - same as any NN forward pass.
The difference with RNNs is the presence of a separate 'state' (note: not a trained weight), that can start off random or zero, and carries information about your sequence across time steps. There's another weight for the state (W2). So starting at the first time step t-20, the initial state is multiplied by W2 and an activation function applied.
So at timestep t-20 we have the output from W1 (on inputs) and W2 (on state). We can combine these outputs at each timestep, and use it to generate the state to pass to the next timestep, i.e. t-19. Because the state has to be calculated at each timestep and passed to the next, these calculations have to happen sequentially starting from t-20. To generate our desired output, we can take each output state across all timesteps - or only take the output at the final timestep. As return_sequences=False by default in Keras, you are only using the output at the final timestep, which then goes into your dense layer.
The weights W1 and W2 need to have one dimension equal to the dimensions of each timestep input x_t-20... for matrix multiplication to work. This dimension is 1 in your case, as each of the 20 inputs are a 1d vector (or number), which is multiplied by W1. However, we're free to set the second dimension of the weights as we please - 300 in your case. So W1 is of size 1x300, and is multiplied 20 times, once for each timestep.
This lecture will take you through the basic flow diagram of RNNs that I described above, all the way to more advanced stuff which you can skip. This is a famous explanation of LSTMs if you want to make the leap from basic RNNs to LSTMs, which you may not need to do - there are just more complicated weights and states.

Output of Tensorflow LSTM-Cell

I've got a question on Tensorflow LSTM-Implementation. There are currently several implementations in TF, but I use:
cell = tf.contrib.rnn.BasicLSTMCell(n_units)
where n_units is the amount of 'parallel' LSTM-Cells.
Then to get my output I call:
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(cell, x,
initial_state=initial_state, time_major=False)
where (as time_major=False) x is of shape (batch_size, time_steps, input_length)
where batch_size is my batch_size
where time_steps is the amount of timesteps my RNN will go through
where input_length is the length of one of my input vectors (vector fed into the network on one specific timestep on one specific batch)
I expect rnn_outputs to be of shape (batch_size, time_steps, n_units, input_length) as I have not specified another output size.
Documentation of nn.dynamic_rnn tells me that output is of shape (batch_size, input_length, cell.output_size).
The documentation of tf.contrib.rnn.BasicLSTMCell does have a property output_size, which is defaulted to n_units (the amount of LSTM-cells I use).
So does each LSTM-Cell only output a scalar for every given timestep? I would expect it to output a vector of the length of the input vector. This seems not to be the case from how I understand it right now, so I am confused. Can you tell me whether that's the case or how I could change it to output a vector of size of the input vector per single lstm-cell maybe?

I think the primary confusion is on the terminology of the LSTM cell's argument: num_units. Unfortunately it doesn't mean, as the name suggests, "the no. of LSTM cells" that should be equal to your time-steps. They actually correspond to the number of dimensions in the hidden state (cell state + hidden state vector).
The call to dynamic_rnn() returns a tensor of shape: [batch_size, time_steps, output_size] where,
(Please note this) output_size = num_units; if (num_proj = None) in the lstm cell
where as, output_size = num_proj; if it is defined.
Now, typically, you will extract the last time_step's result and project it to the size of output dimensions using a mat-mul + biases operation manually, or use the num_proj argument in the LSTM cell.
I have been through the same confusion and had to look really deep to get it cleared. Hope this answer clears some of it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.