I have just started to learn CNN on Tensorflow. However, when I train the model the Loss and accuracy don't change.
I am using images of size 128x128x3 and the images are normalized (in [0,1]). And here is the compiler that I am using.
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.000001), loss='binary_crossentropy', metrics=['accuracy'])
And here is the summary of my model
I tried the following things but I always have the same values:
Change the learning rate from 0.00000001 to 10
Change the convolution kernel I tried 5x5 and 3x3
I added another fully connected layer and a Conv layer.
update
The layers' weights didn't change after fitting the model. I have the same initial weights.
You could try this,
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
Also, remove the softmax activation from the last layer, a binary classification problem does not need a softmax. So, what softmax does in this case is clip the value always to 1 since there is only one probability and thus the network doesn't train. This link might help you understand softmax.
Additionally, you can try using sigmoid activation at the final node. That clips the output to a value in the range of 0 to 1 and the network weights wont blow up because of a very high loss.
Related
I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!
Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.
I am new to Keras and starting with this code from tf tutorial :
# choosing the layers of my models
model = keras.Sequential([ # the sequential model of Keras library
keras.layers.Flatten(input_shape=(28, 28)), # the first input layer
keras.layers.Dense(128, activation='relu'),# the hidden layer
keras.layers.Dense(10)# output layers and 10 corresponds to the number of used classes
])
I wonder what the value 128 is? and how it was calculated?
It's not calculated, it's a hyperparameter (a parameter that isn't estimated by the data, but selected by you prior to running the model). It essentially determines the complexity of the model. The more neurons, the more complex relationships it can model in the data.
128 is a hyper parameter which is the number of nodes in your second to last layer.
It isn't calculated, you can change it to whatever you want, try [18,32,64...etc]. The larger you make it the slower your training will be; however your model might be more accurate since there are more nodes to capture the signal of your dataset.
I am studying some machine learning on my own and I am practicing (in Python) with the assignments of the course held by Andrew Ng.
After completing the fourth exercise by hand, I tought to do it in Keras to practice with the library.
In the exercise we have 5000 images of hand written digits, going from 0 to 9. Each image is a 20x20 matrix. The dataset is stored in a matrix X of shape 5000x400 (each image has been 'unrolled') and the labels are stored in a matrix y of shape 5000x10. Each row of y is a hot-one vector.
The exercise asks to implement backpropagation to maximaze the log likelihood, for a simple neural network with one input layer, one hidden layer and one output layer. The hidden layer has 25 neurons and the output layer 10. We use sigmoid as activation for both layers.
My code in Keras is this
model=Sequential()
model.add(Dense(25,input_shape=(400,),use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.add(Dense(10,use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy',optimizer='sgd',metrics=['accuracy'])
model.fit(X, y, batch_size=5000,epochs=100, verbose=1)
Since I want this to be as similar as possible to the assignment I have used the same initial weights as the assignment, the same regularization parameter, the same activations and gradient descent as a optimizer (actually the assignment uses the Truncated Newton Method but I don't think my problem lies here).
I thought I was doing everything correctly but when I train the network I get a 10% accuracy on the training dataset. Even playing a little bit with the parameters the accuracy doesn't change much. To try to understand better the problem I tested it with smaller pieces of the dataset. For instance if I select a subdataset of 100 elements containing x images of zero and 100-x images of one, I get a x% training accuracy. My guess is that the network is optimizing the parameters to recognise only the first digit.
Now my questions are: what I am missing? Why isn't this the right implementation of the neural network described above?
If you are practising on the MNIST dataset, to classify 10 digits, you have 10 classes to predict. Rather than sigmoid, you should use ReLU in the hidden layers ( in your case the first layer ) and use softmax activation on the output layer. Use categorical crossentropy loss function with adam or sgd optimizer.
I try to build a Stacked Autoencoder in Keras (tf.keras). By stacked I do not mean deep. All the examples I found for Keras are generating e.g. 3 encoder layers, 3 decoder layers, they train it and they call it a day. However, it seems the correct way to train a Stacked Autoencoder (SAE) is the one described in this paper: Stacked Denoising Autoencoders: Learning Useful Representations in
a Deep Network with a Local Denoising Criterion
In short, a SAE should be trained layer-wise as shown in the image below. After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer.
And here is where my trouble begins. How to tell Keras which layers to use the loss function on?
Here is what I do. Since the Autoencoder module is not existed anymore in Keras, I build the first autoencoder, and I set its encoder's weights (trainable = False) in the 1st layer of a second autoencoder with 2 layers in total. Then when I train that, it obviously compares the reconstructed layer out_s2 with the input layer in_s, instead of the layer 1 hid1.
# autoencoder layer 1
in_s = tf.keras.Input(shape=(input_size,))
noise = tf.keras.layers.Dropout(0.1)(in_s)
hid = tf.keras.layers.Dense(nodes[0], activation='relu')(noise)
out_s = tf.keras.layers.Dense(input_size, activation='sigmoid')(hid)
ae_1 = tf.keras.Model(in_s, out_s, name="ae_1")
ae_1.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
# autoencoder layer 2
hid1 = tf.keras.layers.Dense(nodes[0], activation='relu')(in_s)
noise = tf.keras.layers.Dropout(0.1)(hid1)
hid2 = tf.keras.layers.Dense(nodes[1], activation='relu')(noise)
out_s2 = tf.keras.layers.Dense(nodes[0], activation='sigmoid')(hid2)
ae_2 = tf.keras.Model(in_s, out_s2, name="ae_2")
ae_2.layers[0].set_weights(ae_1.layers[0].get_weights())
ae_2.layers[0].trainable = False
ae_2.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])
The solution should be fairly easy, but I can't see it nor find it online. How do I do that in Keras?
It seems like the question is outdated by looking at the comments. But I'll still answer this as the use-case mentioned in this question is not just specific to autoencoders and might be helpful for some other cases.
So, when you say "train the whole network layer by layer", I would rather interpret it as "train small networks with one single layer in a sequence".
Looking at the code posted in this question, it seems that the OP has already built small networks. But both these networks do not consist of one single layer.
The second autoencoder here, takes as input the input of first autoencoder. But, it should actually take as input, the output of first autoencoder.
So then, you train the first autoencoder and collect it's predicitons after it is trained. Then you train the second autoencoder, which takes as input the output (predictions) of first autoencoder.
Now let's focus on this part: "After layer 1 is trained, it's used as input to train layer 2. The reconstruction loss should be compared with the layer 1 and not the input layer."
Since the network takes as input the output of layer 1 (autoencoder 1 in OP's case), it will be comparing it's output with this. The task is achieved.
But to achieve this, you will need to write the model.fit(...) line which is missing in the code provided in the question.
Also, just in case you want the model to calculate loss on input layer, you simply replace the y parameter in model,fit(...) to the input of autoencoder 1.
In short, you just need to decouple these autoencoders into tiny networks with one single layer and then train them as you wish. No need to use trainable = False now, or else use it as you wish.
I'm currently trying to test the differences of behavior between a LSTM RNN and a GRU RNN on the prediction of time series ( classification 1/0 if time series goes up or down). I use the fit_generator method (as in Keras François Chollet book's)
I feed 30 point to the network and the next point has to be classified as up or down. The training set samples are reshuffled while the validation samples are not of course.
If I do not change the default value of the Learning rate ( 10-3 for adam algorithm) then the usual happens, the training set tends to be overfitted after a certain number of epochs, both for LSTM and GRU cells.
LSTM 1 layer 128N
Note that my plots are averaged with 10 simulations ( this way I get rid of the particular weight random initialization )
If I choose lower learning rates, the training and validation accuracy see below the impact of a different learning rate, it appears to me like the training set is not able to overfit anymore (???)
learning rate impact on GRU network
When I compare the LSTM and the GRU, it's even worse, the validation set gets higher accuracy than the training set in the LSTM case. For the GRU case, the curve are close, but the training set is still higher
2 layer LSTM LR=1e-5
2 layer GRU LR=1e-5
Note that this is less emphasized for 1 layer LSTM
1 layer LSTM LR=1e-5
I have tested this with 1 or 2 layers and for a different validation set but results are similar.
extract of my code below:
modelLSTM_2a = Sequential()
modelLSTM_2a.add(LSTM(units=32, input_shape=(None, data.shape[-1]),return_sequences=True))
modelLSTM_2a.add(LSTM(units=32, input_shape=(None, data.shape[-1]),return_sequences=False))
modelLSTM_2a.add(Dense(2))
modelLSTM_2a.add(Activation('softmax'))
adam = keras.optimizers.Adam(lr=1e-5, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
modelLSTM_2a.compile(optimizer= adam, loss='categorical_crossentropy', metrics=['accuracy'])
Would someone have a clue of what may happening?
I'm really puzzled by this behavior, especially by the impact of the learning rate in the LSTM case