Replicated Convolutional Bi-Directional LSTM implementation in Keras diverging

Replicated Convolutional Bi-Directional LSTM implementation in Keras diverging - python

This is the model I am trying to replicate (more information in linked paper):
In our models, we adopted one dropout layer between LSTM models and the first fully-connected layer and another dropout layer between the first fully-connected layer and the second fully-connected layer. Their masking probabilities are both set to 0.5.
...
For our proposed CBLSTM, one-layer CNN is firstly designed, whose filter number, filter size and pooling size are set to 150, 10 and 5. Therefore, the shape of the raw sensory sequence is changed from 100 x 12 to 19 x 150 after CNN. Then, a two-layer bi-directional LSTM is built on top of the CNN.
Backward and forward LSTMs share the same layer sizes as [150, 200]. Therefore, the output of the LSTM module is the concatenated vector of the representations learned by backward and forward LSTMs, and its dimensionality is 400. Then, before feeding the representation into the linear regression layer, two fully-connected layers with a size of [500, 600] are adopted. The nonlinearity activation functions in our proposed CBLSTM are all set to ReLu.
Source: Zhao, R., Yan, R., Wang, J., & Mao, K. (2017). Learning to monitor machine health with convolutional bi-directional LSTM networks. Sensors, 17(2), 273. link to paper
The input is 630 samples x 100 timesteps x 12 features.
How my model looks at the moment:
model = Sequential()
model.add(Conv1D(filters=150, kernel_size=10, activation='relu', input_shape=(100,12)))
model.add(MaxPooling1D(pool_size=5, strides=None, padding='valid'))
model.add(Bidirectional(LSTM(150, return_sequences=True), merge_mode='concat'))
model.add(Bidirectional(LSTM(200, return_sequences=False), merge_mode='concat'))
model.add(Dropout(0.5))
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(600, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['mae'])
While the training loss steadily decreases per epoch, the validation set does not and diverges pretty quickly. This indicates that there is a mistake in my model which I have not yet been able to find. Any ideas as to what is wrong?
Side note: I am using the same data as input as the authors.

Related

Feature Normalization/Standard Scalar in Keras

I am working with a Sequential Keras model and I trying to figure out the best method for feature scaling.
model = Sequential()
model.add(Masking(mask_value=-50, input_shape=(None,10)))
model.add(LayerNormalization(axis=-1))
model.add(LSTM(100, input_shape=(None,10)))
model.add(Dense(100, activation='relu'))
model.add(Dense(3, activation='softmax'))
print(model.summary())
In line 3, I have a LayerNormalization layer which according to documentation, scales to mean and standard deviation. However, I have also come across Batch normalization and tf.keras.layers.experimental.preprocessing.Normalization. My question is is this method similar to Sklearn's StandardScalar() or is there another method I could use to feature scale within the model?

This should work. It uses an UpSampling layer for a naive 5x5 image-based input:
# define model
model = Sequential()
# define input shape, output enough activations for for 128 5x5 image
model.add(Dense(128 * 5 * 5, input_dim=100))
# reshape vector of activations into 128 feature maps with 5x5
model.add(Reshape((5, 5, 128)))
# double input from 128 5x5 to 1 10x10 feature map
model.add(UpSampling2D())
# fill in detail in the upsampled feature maps and output a single image
model.add(Conv2D(1, (3,3), padding='same'))
# summarize model
model.summary()
But you can use the Conv2DTranspose layer too, which combines the UpSampling2D and Conv2D layers into one layer.
A TimeDistributed layer in the case of LSTMs will help. Refer

Is this a valid seq2seq lstm model?

Hello I am trying to build a seq2seq model to generate some music.
I really dont know much about it though.
On the internet I have found this model:
def createSeq2Seq():
#seq2seq model
#encoder
model = Sequential()
model.add(LSTM(input_shape = (None, input_dim), units = num_units, activation= 'tanh', return_sequences = True ))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(LSTM(num_units, activation= 'tanh'))
#decoder
model.add(RepeatVector(y_seq_length))
num_layers= 2
for _ in range(num_layers):
model.add(LSTM(num_units, activation= 'tanh', return_sequences = True))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(TimeDistributed(Dense(output_dim, activation= 'softmax')))
return model
My data is a list of pianorolls. A piano roll is a matrix with the columns representing a one-hot encoding of the different possible pitches (49 in my case) with each column representing a time (0,02s in my case). The pianoroll matrix is then only ones and zeros.
I have prepared my training data reshaping my pianoroll songs (putting them all one after the other) into
shape = (something, batchsize, 49). So my input data are all the songs one after the other separeted in blocks of size the batchsize. My training data is then the same input but delayed one batch.
The x_seq_length and y_seq_length are equal to the batch_size. Input_dim = 49
My input and output sequences have the same dimension.
Have I made any mistake in my reasoning? Is the seq2seq model Ive found correct? What does the RepeatVector does?

This is not a seq2seq model. RepeatVector takes the last state of the last encoder LSTM and makes one copy per output token. Then you feed these copies into a "decoder" LSTM, which thus has the same input in every time step.
A proper autoregressive decoder takes its previous outputs as input, i.e., at training time, the input of the decoder is the same as its output, but shifted by one position. This also means that your model misses the embedding layer for the decoder inputs.

Understanding of Basic Neural Network Structure

Let's say I want to code this basic Neural Network Structure in Keras which has 10 units in Input Layer and 3 units in Output layer.
Now if I am using Keras, and give input_shape of more then 10, how it will adjust in it.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential()
model.add(Dense(10, activation = 'relu', input_shape = (64,)))
model.add(Dense(3, activation = 'sigmoid'))
model.summary()
You see, here input_shape is of size 64, but how will it adjust in model whose first layer has 10 units because for what I have learned that size of input shape/vector should be equal to number of units in the input layer.
Or Am I not implementing this neural network right?

That would not be a problem. The weight matrix of shape (10,64) would be used in input layer. your input has shape 64 and first hidden layer has 10 units giving a output of 3 units. Seems fine to me.
But your input layer itself is 64. So what you are getting is a 3-layer network with a hidden layer of 10 units.

If the shape of your input vector is 64, then you really need to have an input layer with size 64. The input layer of a neural network doesn't perform any computations. It just passes the inputs forward to the first hidden layer. This one, on the other hand, performs the computations for all neurons contained in it (linear combination of input vector and weights, later served as an input to the activation function, which is the ReLU in your case).
In your code, you are building a neural net with 64 input neurons (which again don't perform any computations), 10 neurons in the first (and only) hidden layer and 3 neurons in the output layer.

How many parameters are being optimised over in a simple CNN?

Okay so here's my CNN (simple example from a tutorial) along with some arithmetic to get the total number of free parameters.
We've got a dataset of 28*28 grayscale image (MNIST).
First layer is a 2D convolution using 32 3x3 kernels. Dimensionality of the output is 26x26x32 (kernel stride length was 1 and we have 32 feature maps of 26x26). Running parameter count: 288
Second layer is 2x2 MaxPool with a 2x2. Dimensionality of the output is 13x13x32 but then we flatten so we got a vector of length 5408. No extra parameters here.
Third layer is Dense. A 5408x100 matrix. Dimensionality of the output is 100. Running Parameter count: 540988
Fourth layer is Dense also. A 100x10 matrix. Dimensionality of the output is 10. Running Parameter count: 541988
Then we're supposed to do stochastic gradient descent on a 541988 parameter space!
That feels like a ridiculously big number to me. And this is meant to be the hello world problem of CNNs. Am I missing something fundamental in my understanding of how this is meant to work? Or maybe the number is correct but it's not actually a big deal for a computer to crunch?
In case it helps. Here is how the model was built in Keras:
def define_model():
model = Sequential()
model.add(Conv2D(32, (3,3), activation = 'relu', kernel_initializer = 'he_uniform', input_shape=(28,28,1)))
model.add(MaxPooling2D((2,2)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(10, activation='softmax'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(optimizer=opt, loss='categorical_crossentropy', metric=['accuracy'])
return model

Input nodes in Keras NN

I am trying to create an neural network based on the iris dataset. I have an input of four dimensions. X = dataset[:,0:4].astype(float). Then, I create a neural network with four nodes.
model = Sequential()
model.add(Dense(4, input_dim=4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
As I understand, I pass each dimension to the separate node. Four dimensions - four nodes. When I create a neural network with 8 input nodes, how does it work? Performance still is the same as with 4 nodes.
model = Sequential()
model.add(Dense(8, input_dim=4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

You have an error on your last activation. Use softmax instead of sigmoid and run again.
replace
model.add(Dense(3, init='normal', activation='sigmoid'))
with
model.add(Dense(3, init='normal', activation='softmax'))

To answer your main question of "How does this work?":
From a conceptual standpoint, you are initially creating a fully-connected, or Dense, neural network with 3 layers: an input layer with 4 nodes, a hidden layer with 4 nodes, and an output layer with 3 nodes. Each node in the input layer has a connection to every node in the hidden layer, and same with the hidden to the output layer.
In your second example, you just increased the number of nodes in the hidden layer from 4 to 8. A larger network can be good, as it can be trained to "look" for more things in your data. But too large of a layer and you may overfit; this means the network remembers too much of the training data, when it really just needs a general idea of the training data so it can still recognize slightly different data, which is your testing data.
The reason you may not have seen an increase in performance is likely either overfitting or your activation function; Try a function other than relu in your hidden layer. After trying a few different function combinations, if you don't see any improvement, you are likely overfitting.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replicated Convolutional Bi-Directional LSTM implementation in Keras diverging - python

Related

Feature Normalization/Standard Scalar in Keras

Is this a valid seq2seq lstm model?

Understanding of Basic Neural Network Structure

How many parameters are being optimised over in a simple CNN?

Input nodes in Keras NN

Categories

Resources