Train Stacked Autoencoder Correctly

Train Stacked Autoencoder Correctly - python

I try to build a Stacked Autoencoder in Keras (tf.keras). By stacked I do not mean deep. All the examples I found for Keras are generating e.g. 3 encoder layers, 3 decoder layers, they train it and they call it a day. However, it seems the correct way to train a Stacked Autoencoder (SAE) is the one described in this paper: Stacked Denoising Autoencoders: Learning Useful Representations in
a Deep Network with a Local Denoising Criterion
In short, a SAE should be trained layer-wise as shown in the image below. After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer.
And here is where my trouble begins. How to tell Keras which layers to use the loss function on?
Here is what I do. Since the Autoencoder module is not existed anymore in Keras, I build the first autoencoder, and I set its encoder's weights (trainable = False) in the 1st layer of a second autoencoder with 2 layers in total. Then when I train that, it obviously compares the reconstructed layer out_s2 with the input layer in_s, instead of the layer 1 hid1.
# autoencoder layer 1
in_s = tf.keras.Input(shape=(input_size,))
noise = tf.keras.layers.Dropout(0.1)(in_s)
hid = tf.keras.layers.Dense(nodes[0], activation='relu')(noise)
out_s = tf.keras.layers.Dense(input_size, activation='sigmoid')(hid)
ae_1 = tf.keras.Model(in_s, out_s, name="ae_1")
ae_1.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
# autoencoder layer 2
hid1 = tf.keras.layers.Dense(nodes[0], activation='relu')(in_s)
noise = tf.keras.layers.Dropout(0.1)(hid1)
hid2 = tf.keras.layers.Dense(nodes[1], activation='relu')(noise)
out_s2 = tf.keras.layers.Dense(nodes[0], activation='sigmoid')(hid2)
ae_2 = tf.keras.Model(in_s, out_s2, name="ae_2")
ae_2.layers[0].set_weights(ae_1.layers[0].get_weights())
ae_2.layers[0].trainable = False
ae_2.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
The solution should be fairly easy, but I can't see it nor find it online. How do I do that in Keras?

It seems like the question is outdated by looking at the comments. But I'll still answer this as the use-case mentioned in this question is not just specific to autoencoders and might be helpful for some other cases.
So, when you say "train the whole network layer by layer", I would rather interpret it as "train small networks with one single layer in a sequence".
Looking at the code posted in this question, it seems that the OP has already built small networks. But both these networks do not consist of one single layer.
The second autoencoder here, takes as input the input of first autoencoder. But, it should actually take as input, the output of first autoencoder.
So then, you train the first autoencoder and collect it's predicitons after it is trained. Then you train the second autoencoder, which takes as input the output (predictions) of first autoencoder.
Now let's focus on this part: "After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer."
Since the network takes as input the output of layer 1 (autoencoder 1 in OP's case), it will be comparing it's output with this. The task is achieved.
But to achieve this, you will need to write the model.fit(...) line which is missing in the code provided in the question.
Also, just in case you want the model to calculate loss on input layer, you simply replace the y parameter in model,fit(...) to the input of autoencoder 1.
In short, you just need to decouple these autoencoders into tiny networks with one single layer and then train them as you wish. No need to use trainable = False now, or else use it as you wish.

Related

Need help defining a simple neural network

I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!

Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.

Keras model for multiclass classification for sentiment analysis with LSTM - how can my model be improved?

So I want to do predict the number of stars a product gets on Amazon through keras, I have seen other ways of doing this, but I have used the universal sentence encoder with one-hot encoding (I have followed a Youtube tutorial to embed the reviews). Now without using an LSTM layer and using the following layers:
`model.add(keras.layers.Dense(units=256,input_shape=(X_train.shape[1], ),activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(units=128,activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I am able to get an accuracy of around 0.55 and a loss of 1, which isn't great. However when I reshape my X_train and X_test data to be 3D input for an LSTM layer and then put it into a model such as:
`model.add(keras.layers.Dense(units=256,input_shape=(512, 1), activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.3)))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I get an accuracy of around 0.2 which is even worse, with a loss of close to 2.00.
I have no idea whether an LSTM is necessary as I am new to neural networks but I have been trying to do this for my project.
So I am asking should I stick with the first model without an LSTM or is there a way of changing the second neural network with LSTM to have an accuracy of 0.2 whilst using the embedding methods that I have used?
Thanks for your time!

Why you should choose LSTM instead of normal neurons is because in language, there is a relationship between words and that is important in understanding what the sentence means. The model with only dense layer is not able to do that great because there are no connections it that can store such information, it just predicts by looking at the whole picture and not the connections the words have in between. Coming to LSTM, they stand for Long Short Term Memory, in short, what they have is the capability to remember data that they had seen previously, which helps it in creating connections with different words in the same sentence.
Coming to how you would go about creating your model. First, you need a Tokenizer in the TF library to create token out of your data, then convert your sequence into numbers through it, then pad your data using pad_sequences. Your data is then ready. In your network, your first layer should be an Embedding layer. Followed by it you can have the LSTM (as I have explained why you should use them) or Bidirectional LSTM (they can learn the dependency from left-to-right and right-to-left, performs better than unidirectional LSTM) or Conv1D (according to filter size it is able model dependencies in lying in its filter length, it has been used and works, you can try) layers, followed by pooling layer (GlobalMaxPooling1D) and then, dense layers to get your predictions.

Keras LSTM, is the time_step equal to 1 like transforming the LSTM into a MLP?

I'm a beginer in this field of Deep Learning. I'm trying to use Keras for a LSTM in a regression problem. I would like to build an ANN which could exploit the memory cell between one prediction and the next one.
In more details... I have a neural network (Keras) with 2 Hidden layer-LSTM and 1 output layer for a regression context.
The batch_size is equal to 7, timestep equal to 1 and I have 5749 samples.
I'm only interested to understand if using timestep == 1 is the same thing as using an MLP instead of LSTM. For time_step, I'm referring to the reshape phase for the input of the Sequential model in Keras. The output is a single regression.
I'm not interested in the previous inputs, but I'm interested only on the output of the network as an information for the next prediction.
Thank you in advance!

You can say so :)
You're right in thinking that you won't have any recurrency anymore.
But internally, there will be still more operations than in regular Dense layers, due to the existence of more kernels.
But be careful:
If you use stateful=True, it will still be a recurrent LSTM!
If you use initial states properly, you can still make it recurrent.
If you're interested in creating custom operations with the memory/state of the cells, you could try creating your custom recurrent cell taking the LSTMCell code as a template.
Then you'd use that cell in a RNN(CustomCell, ...) layer.

Transfer Learning, adding Keras LSTM layer, (hot dog, not hot dog using binary cross-entropy)

The training features, shape (1032, 5, 5, 122880), go into an LSTM layer. This produces "ValueError: Input 0 is incompatible with layer lstm_16: expected ndim=3, found ndim=2"
1032 is the number of training samples
5x5x122880 is the training sample's bottlenecked features
model = Sequential()
model.add(CuDNNLSTM(2048,input_shape=train_final_model.shape[:1))
model.add(Dense(2, activation='sigmoid'))
It seems as though the Keras ValueError is telling me that I should reshape the training bottlenecked features before running them through the LSTM layer. But doing that:
training_bottlenecked_features = np.reshape(train_final_model, (1032,25,122880))
print(training_bottlenecked_features.shape)
final_model.add(LSTM(2,input_shape=training_bottlenecked_features.shape[:1]))
Yields this:
(1032, 25, 122880)
"ValueError: Input 0 is incompatible with layer lstm_23: expected ndim=3, found ndim=2"
I've played around with the input in several other combinations, so I feel I may be missing something fundamental in the Keras LSTM documentation: https://keras.io/layers/recurrent/ Thank you for any insight you may have.

It turns out that LSTM and RNNs rely on using keras.layers.TimeDistributed. This requires minimally restructuring the training and validation label dimensions (e.g. using np.expand_dims()).
If TimeDistributed is used to wrap the entire Sequential flow, you will likely also need to reshape the training and validation data.
The dialog here is helpful for recalling recurrent network architectural distinctions:
https://github.com/keras-team/keras/issues/1029
Reflecting back, I wish I would have started reading here:
https://keras.io/layers/wrappers/

Feature Extraction on Neural Networks Using Theano

I have a trained network, which consists of the following layers:
{conv1, pool1, conv2, pool2, conv3, pool3, conv4, pool4, fc5, fc6, output}
which fc means fully connected layers and conv means convolutional layers.
I need to do feature extraction for some images. I am using Lasagne and Theano.
I need to save features from each layer for later analysis. I am a newbie in this language so I tried to find sample codes or some tutorials on this (with theano/lasagne). However, I failed to understand what should I do by myself.
I would appreciate if someone can guide me on this in order to how to implement feature extraction.
Thank you in advance
Edit: I followed comments by Mr/Ms gntoni, here is my code:
feat_all = []
for layer in layer_list:
feat = np.zeros_like(lasagne.layers.get_output([self.acnn.cnn[layer]], inputs = img, deterministic=True))
feat[:] = lasagne.layers.get_output([self.acnn.cnn[layer]], inputs = img, deterministic=True)
feat_all.append(feat)
=
For my case, I need to save features from each layer. I want to write a function like the one that we have in Caffe:
self.net.blobs['data'].data[0] = img
self.net.forward(end=layer_list[-1])
feat_all = []
for layer in layer_list:
feat = np.zeros_like(self.net.blobs[layer].data[0])
feat[:] = self.net.blobs[layer].data[0]
feat_all.append(feat)
However, my trained model is written with lasagne and theano, So I have to implement this in lasagne format.
After writing the code above (in lasagne), I am getting an empty output.
I wonder why and how can I fix it.
Thank you in advance

A Convolutional Neural Network, like yours, consists in two parts:
The first one is the feature extraction part and in your case consists in the conv-pool layers {conv1, pool1, conv2, pool2, conv3, pool3, conv4, pool4}.
The second is the classification part. In your network: {fc5, fc6, output}.
When training, the first part is trying to obtain the best representation of the input data to be classified by the second part.
So, if after trained, you disconnect this two parts, the output of conv4 layer will be giving you the features you want.
This features can be used with a different classifier. In fact, many people use an already trained network (e.g. AlexNet), remove the last classification layers, and use the features with their own classification system.

Keep in mind that in Lasagne the get_output method returns Theano tensors, and you cannot directly use them to compute the features from a numpy array. However, you can define a Theano function and use it to compute the values. In your case:
layers = [self.acnn.cnn[layer] for layer in layer_list]
feat_fn = theano.function([input_var], lasagne.layers.get_output(layers),
deterministic=True)
where input_var is the input tensor to your network. The get_output method can accept multiple layers and Theano functions can have multiple outputs, so you can define a single function to extract all the features. Getting the numerical values is then as simple as:
feat_all = feat_fn(img)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.