I am doing a multiclass classification using LSTM model.
One sample is 20 frames of data, each frame has 64 infrared
signals, so each 20 × 64 dimension matrix signal is converted into a 1 × 1280 dimension vector (one sample).
There are 1280 nodes in the input layer of LSTM.
Then I need to build the following LSTM model:
the number of nodes in the hidden layer is 640 and each hidden
layer node is connected to a full connection layer with 100 backward nodes, and there is a
ReLU activation layer behind the full connection layer. Finally, the softmax activation
function is used to normalize the data to obtain the output. Additionally, the timesteps of
LSTM are set to 16.
Here is my attempt to build this architecture according to intsructions above:
embedding_vecor_length = 16
model_1 = Sequential()
model_1.add(Embedding(len(X_train), embedding_vecor_length, input_length=1280))
model_1.add(LSTM(640))
model_1.add(Dense(100, activation='relu'))
model_1.add(Dense(4, activation='softmax'))
model_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_1.summary())
model_1.fit(X_train, y_train, epochs=10, batch_size=10)
I am very confused by the hidden layer of LSTM and fully connected layer. According to these instructions, should my fully connected layer be inside LSTM block? And what does it mean backward nodes? Also, where do we indicate the timesteps of LSTM? Could somebody explain please? Thank you!
The LSTM is itself a fully connected layer, which just happens to maintain a state between calls.
I'm not sure what the term backwards really means, but from the structure you show it simply looks like a non-linear transformation of the LSTM output.
The timesteps of the LSTM are generally indicated in the input. I may be wrong, but I think in your case you actually don't want to reshape your input the way you do (by multiplying the frames). You have 20 frames with 64 signals each, so really you have 20 timesteps with an input of size 64.
Related
I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!
Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.
I have just started to learn CNN on Tensorflow. However, when I train the model the Loss and accuracy don't change.
I am using images of size 128x128x3 and the images are normalized (in [0,1]). And here is the compiler that I am using.
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.000001), loss='binary_crossentropy', metrics=['accuracy'])
And here is the summary of my model
I tried the following things but I always have the same values:
Change the learning rate from 0.00000001 to 10
Change the convolution kernel I tried 5x5 and 3x3
I added another fully connected layer and a Conv layer.
update
The layers' weights didn't change after fitting the model. I have the same initial weights.
You could try this,
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
Also, remove the softmax activation from the last layer, a binary classification problem does not need a softmax. So, what softmax does in this case is clip the value always to 1 since there is only one probability and thus the network doesn't train. This link might help you understand softmax.
Additionally, you can try using sigmoid activation at the final node. That clips the output to a value in the range of 0 to 1 and the network weights wont blow up because of a very high loss.
I am new to Keras and starting with this code from tf tutorial :
# choosing the layers of my models
model = keras.Sequential([ # the sequential model of Keras library
keras.layers.Flatten(input_shape=(28, 28)), # the first input layer
keras.layers.Dense(128, activation='relu'),# the hidden layer
keras.layers.Dense(10)# output layers and 10 corresponds to the number of used classes
])
I wonder what the value 128 is? and how it was calculated?
It's not calculated, it's a hyperparameter (a parameter that isn't estimated by the data, but selected by you prior to running the model). It essentially determines the complexity of the model. The more neurons, the more complex relationships it can model in the data.
128 is a hyper parameter which is the number of nodes in your second to last layer.
It isn't calculated, you can change it to whatever you want, try [18,32,64...etc]. The larger you make it the slower your training will be; however your model might be more accurate since there are more nodes to capture the signal of your dataset.
I'm using Keras with the TensorFlow backend to extract features from images with a pre-trained model (VGG16 on ImageNet). From what I can read online, I should get for each image a vector with 4096 features.
I'm using this line to import the model without the last fully connected layer (as I believe I'm supposed to):
applications.vgg16.VGG16(weights='imagenet', include_top=False, pooling='avg'
However, the vector I get in the end only has 512 features. Considering VGG16's architecture:
It looks like I'm actually getting the results from the last max pooling layer (which would be consistent with the Keras documentation).
So am I supposed to get 512 or 4096 features?
According to the Keras documentation when you set include_top = False it overlooks the last 3 Fully Connected(FC) layers so intuitively you should be getting a 512 feature vector which is correct. If you wish to consider the last 3 FC layers set include_top = True. Then you would get a 1000 feature prediction (considering the softmax layer at the end).
Try executing:
vggmodel = keras.applications.vgg16.VGG16(weights='imagenet', include_top=False, pooling='avg')
vggmodel.summary()
and
vggmodel = keras.applications.vgg16.VGG16(weights='imagenet', include_top=True, pooling='avg')
vggmodel.summary()
to get a more comprehensive understanding.
I try to build a Stacked Autoencoder in Keras (tf.keras). By stacked I do not mean deep. All the examples I found for Keras are generating e.g. 3 encoder layers, 3 decoder layers, they train it and they call it a day. However, it seems the correct way to train a Stacked Autoencoder (SAE) is the one described in this paper: Stacked Denoising Autoencoders: Learning Useful Representations in
a Deep Network with a Local Denoising Criterion
In short, a SAE should be trained layer-wise as shown in the image below. After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer.
And here is where my trouble begins. How to tell Keras which layers to use the loss function on?
Here is what I do. Since the Autoencoder module is not existed anymore in Keras, I build the first autoencoder, and I set its encoder's weights (trainable = False) in the 1st layer of a second autoencoder with 2 layers in total. Then when I train that, it obviously compares the reconstructed layer out_s2 with the input layer in_s, instead of the layer 1 hid1.
# autoencoder layer 1
in_s = tf.keras.Input(shape=(input_size,))
noise = tf.keras.layers.Dropout(0.1)(in_s)
hid = tf.keras.layers.Dense(nodes[0], activation='relu')(noise)
out_s = tf.keras.layers.Dense(input_size, activation='sigmoid')(hid)
ae_1 = tf.keras.Model(in_s, out_s, name="ae_1")
ae_1.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
# autoencoder layer 2
hid1 = tf.keras.layers.Dense(nodes[0], activation='relu')(in_s)
noise = tf.keras.layers.Dropout(0.1)(hid1)
hid2 = tf.keras.layers.Dense(nodes[1], activation='relu')(noise)
out_s2 = tf.keras.layers.Dense(nodes[0], activation='sigmoid')(hid2)
ae_2 = tf.keras.Model(in_s, out_s2, name="ae_2")
ae_2.layers[0].set_weights(ae_1.layers[0].get_weights())
ae_2.layers[0].trainable = False
ae_2.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
The solution should be fairly easy, but I can't see it nor find it online. How do I do that in Keras?
It seems like the question is outdated by looking at the comments. But I'll still answer this as the use-case mentioned in this question is not just specific to autoencoders and might be helpful for some other cases.
So, when you say "train the whole network layer by layer", I would rather interpret it as "train small networks with one single layer in a sequence".
Looking at the code posted in this question, it seems that the OP has already built small networks. But both these networks do not consist of one single layer.
The second autoencoder here, takes as input the input of first autoencoder. But, it should actually take as input, the output of first autoencoder.
So then, you train the first autoencoder and collect it's predicitons after it is trained. Then you train the second autoencoder, which takes as input the output (predictions) of first autoencoder.
Now let's focus on this part: "After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer."
Since the network takes as input the output of layer 1 (autoencoder 1 in OP's case), it will be comparing it's output with this. The task is achieved.
But to achieve this, you will need to write the model.fit(...) line which is missing in the code provided in the question.
Also, just in case you want the model to calculate loss on input layer, you simply replace the y parameter in model,fit(...) to the input of autoencoder 1.
In short, you just need to decouple these autoencoders into tiny networks with one single layer and then train them as you wish. No need to use trainable = False now, or else use it as you wish.