Currently I am trying to implement a capsule network using Xifeng Guo's Keras code for capsule nets. I have a dataset of brain tumor images with 98 negatively labeled instances and 155 positively labeled instances. I would like to use the capsnet to predict either positive or negative for a brain tumor on the image. Unfortunately I cannot figure out why it is not going beyond a set accuracy / loss. I have attempted data augmentation to increase the dataset size, with a 50/50 prediction as a result.
I have read the paper on 'Capsule Networks against Medical Imaging Data Challenges', where they did a capsule net implementation on, amongst others, the DIARETDB1 dataset, which comprises of only 89 images, and it gets decent predictions, even without data augmentation (0.887 F1 score on imbalanced scenario 1). This makes me believe maybe something is going wrong in the network. FYI: My images are normalized and cropped.
Any input is appreciated!
%pylab inline
import os
import numpy as np
import tensorflow as tf
import keras
import keras.backend as K
from capsulelayers import CapsuleLayer, PrimaryCap, Length, Mask
from keras import layers, models, optimizers
from keras.applications import vgg16
from keras.layers import Conv2D, MaxPooling2D
K.set_image_data_format('channels_last')
def CapsNet(input_shape, n_class, routings):
x = layers.Input(shape=input_shape)
# Layer 1: Just a conventional Conv2D layer
conv1 = Conv2D(filters=256, kernel_size=9, strides=1, padding='valid', activation='relu', name='conv1')(x)
# Layer 2: Conv2D layer with `squash` activation, then reshape to [None, num_capsule, dim_capsule]
primarycaps = PrimaryCap(conv1, dim_capsule=8, n_channels=32, kernel_size=9, strides=2, padding='valid')
# Layer 3: Capsule layer. Routing algorithm works here.
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(primarycaps)
# Layer 4: This is an auxiliary layer to replace each capsule with its length. Just to match the true label's shape.
# If using tensorflow, this will not be necessary. :)
out_caps = Length(name='capsnet')(digitcaps) # CAN WE EXCLUDE THIS IN KERAS TOO?
# Decoder network.
y = layers.Input(shape=(n_class,))
masked_by_y = Mask()([digitcaps, y]) # The true label is used to mask the output of capsule layer. For training
masked = Mask()(digitcaps) # Mask using the capsule with maximal length. For prediction
# Shared Decoder model in training and prediction
decoder = models.Sequential(name='decoder')
decoder.add(layers.Dense(512, activation='relu', input_dim=16*n_class))
decoder.add(layers.Dense(1024, activation='relu'))
decoder.add(layers.Dense(np.prod(input_shape), activation='sigmoid'))
decoder.add(layers.Reshape(target_shape=input_shape, name='out_recon'))
# Models for training and evaluation (prediction)
train_model = models.Model([x, y], [out_caps, decoder(masked_by_y)])
eval_model = models.Model(x, [out_caps, decoder(masked)])
# manipulate model
noise = layers.Input(shape=(n_class, 16))
noised_digitcaps = layers.Add()([digitcaps, noise])
masked_noised_y = Mask()([noised_digitcaps, y])
manipulate_model = models.Model([x, y, noise], decoder(masked_noised_y))
return train_model, eval_model, manipulate_model
def margin_loss(y_true, y_pred):
"""
Margin loss for Eq.(4). When y_true[i, :] contains not just one `1`, this loss should work too. Not test it.
:param y_true: [None, n_classes]
:param y_pred: [None, num_capsule]
:return: a scalar loss value.
"""
L = y_true * K.square(K.maximum(0., 0.9 - y_pred)) + \
0.5 * (1 - y_true) * K.square(K.maximum(0., y_pred - 0.1))
return K.mean(K.sum(L, 1))
model, eval_model, manipulate_model = CapsNet(input_shape=x_train.shape[1:],
n_class=1,
routings=2)
# compile the model
model.compile(optimizer=optimizers.Adam(lr=3e-3),
loss=[margin_loss, 'mse'],
metrics={'capsnet': 'accuracy'})
model.summary()
history = model.fit(
[x_train, y_train],[y_train,x_train],
batch_size=16,
epochs=30,
validation_data=([x_val, y_val], [y_val, x_val]),
shuffle=True)
The result is plenty of epochs where neither the accuracy nor the loss really changes:
Epoch 1/30
161/161 [==============================] - 12s 77ms/step - loss: 0.2700 - capsnet_loss: 0.1911 - decoder_loss: 0.0789 - capsnet_acc: 0.5901 - val_loss: 0.2153 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0565 - val_capsnet_acc: 0.6078
Epoch 2/30
161/161 [==============================] - 9s 56ms/step - loss: 0.2046 - capsnet_loss: 0.1560 - decoder_loss: 0.0486 - capsnet_acc: 0.6149 - val_loss: 0.2015 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0427 - val_capsnet_acc: 0.6078
Epoch 3/30
161/161 [==============================] - 9s 56ms/step - loss: 0.1960 - capsnet_loss: 0.1560 - decoder_loss: 0.0401 - capsnet_acc: 0.6149 - val_loss: 0.1982 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0394 - val_capsnet_acc: 0.6078
There exist two vector transformation procedure to obtain capsules from convolutions namely, Matrix vector transformation and convolutional vector transformation. Since you are having small amount of data, it is better to use convolutional vector transformation which is better in this case.
I advise you to introduce a batch normalization layer under the first convolutional layer and see what it gives.
I had the same problem with training the capsule network on some datasets in which the training process did not converge.
I accidentally reduced the Adam learning rate default parameter from 0.001 to 0.000001 and the problem was solved.
So, I think this parameter plays an important role here.
Related
I am working on a project to implement CNN-LSTM sentiment analysis. Below is the code
from keras.models import Sequential
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Conv1D , MaxPool1D , Flatten , Dropout
from keras.layers import BatchNormalization
from keras import regularizers
model7 = Sequential()
model7.add(Embedding(max_words, 40,input_length=max_len)) #The embedding layer
model7.add(Conv1D(20, 5, activation='relu', kernel_regularizer = regularizers.l2(l = 0.0001), bias_regularizer=regularizers.l2(0.01)))
model7.add(Dropout(0.5))
model7.add(Bidirectional(LSTM(20,dropout=0.5, kernel_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))))
model7.add(Dense(1,activation='sigmoid'))
model7.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
checkpoint7 = ModelCheckpoint("best_model7.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model7.fit(X_train_padded, y_train, epochs=10,validation_data=(X_test_padded, y_test),callbacks=[checkpoint7])
Even after adding regularizers and dropout, my model has very high validation loss and low accuracy.
Epoch 3: val_accuracy improved from 0.54517 to 0.57010, saving model to best_model7.hdf5
2188/2188 [==============================] - 290s 132ms/step - loss: 0.4241 - accuracy: 0.8301 - val_loss: 0.9713 - val_accuracy: 0.5701
My train and test data:
train: (70000, 7)
test: (30000, 7)
train['sentiment'].value_counts()
1 41044
0 28956
test['sentiment'].value_counts()
1 17591
0 12409
Can anyone please let me know how to reduce overfitting.
Since your code works, I believe that your network is failing silently by 'not learning' a lot from the data. Here's a list of some of the things you can generally check:
Is your textual data well transformed into numerical data? Is it well reprented using TF-IDF or bag of words or any other method that returns a numerical representation?
I see that you imported batch normalization but you do not apply it. Batch norm actually helps and most importantly, does the job of regularizers since each input to each layer is normalized using the mini-batch the network has seen. So maybe remove your L2 regularizations in all layers and apply a simple batch norm instead which should reduce overfitting (also, use it without the drop out since some empirical studies show that they should not be combined together)
Your embedding output is currently set to 40, that is 40 numerical elements of a text vector that may contain more than 10,000 elements. It seems a bit low. Try something more 'standard' such as 128 or 256 instead of 40.
Lastly, you set the adam optimizer with all the default parameters. However, the learning rate can have a big impact on the way your loss function is computed. As I am sure you know, the gradient step uses this learning rate to progress in its calculation of the derivatives for each neuron. the default is learning_rate=0.001. So try the following code and increase a bit the learning rate (for example 0.01 or even 0.1).
A simple example :
# define model
model = Sequential()
model.add(LSTM(32)) # or CNN
model.add(BatchNormalization())
model.add(Dense(1))
# define optimizer
optimizer = keras.optimizers.Adam(0.01)
# define loss function
loss = keras.losses.binary_crossentropy
# define metric to optimize
metric = [keras.metrics.Accuracy(name='accuracy')] # you can add more
# compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metric)
Final thought: I see that you went for a combination of CNN and LSTM which has great merite. However, it is always recommended to try a simple MLP network to establish a baseline score that you may later try to beat. Does a simple MLP with 1 or 2 layers and not a lot of units produce a low accuracy score as well? If it performs better than maybe the problem is in the implementation or in the hyper parameters that you chose for the layers (or even theoretical).
I hope this answer helps and cheers!
My model is like this:
def _get_model(input_shape, latent_dim, num_classes):
inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)
during training I get:
Epoch 1/2000
1/1 [==============================] - 1s 698ms/step - loss: 0.2338 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1185 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2341 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1181 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 2/2000
1/1 [==============================] - 0s 34ms/step - loss: 0.2328 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1175 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2329 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1169 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 3/2000
1/1 [==============================] - 0s 38ms/step - loss: 0.2316 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1163 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2315 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1155 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
when i see history:
print (history.history.keys)
dict_keys(['loss', 'activation_26_loss', 'lstm_151_loss', 'activation_26_accuracy', 'lstm_151_accuracy', 'val_loss', 'val_activation_26_loss', 'val_lstm_151_loss', 'val_activation_26_accuracy', 'val_lstm_151_accuracy'])
which ones are the training loss and training accuracy?
Since there are only 2 outputs, why are there 3 losses,loss,activation_26_lossand lstm_151_loss BUT 2 accuracies:activation_26_accuracy and lstm_151_accuracy? what is each loss and each accuracy standing for?
TLDR;
Three losses (2+1), two losses for individual outputs, and one as the combination of the 2 losses weighed by 0.5 each. You can set both the losses explicitly and their weights as well.
Two accuracies since there are 2 outputs. metrics are just for the user to view and don't affect the neural network.
Detailed explanation;
Let's try to see what you are doing here first. (I am referring to the previous question you asked to get the shapes for inputs.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<------- One input, 2 outputs
model.compile(optimizer='adam', loss='mse')
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
You are building a supervised model that takes an input of (15,5) shape and outputs 2 things: first a (7,) which should contain the cell_states from the 7 LSTM cells and second a (4,) vector that should contain probability values for the 4 classes. The loss you are using to train the model for learning how to predict both of the outputs is mse.
Since this is a supervised model, you will have to provide the model samples of inputs and outputs. If you have 100 samples then your inputs would be (100,15,5) shaped and your outputs will be (100,7) and (100,4), since you have 2 outputs.
Loss(y_actual, y_pred) is a function that tells the neural network how far is its prediction from the actual value. Based on this, it tells the neural network to update itself (its weights specifically using backpropagation) so that its predictions become closer and closer to actual and thus reduce the Loss.
If the above points are clear then let's look at what this network is doing specifically
Your current model has one input and 2 outputs.
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
Since you have defined mse as loss, both the outputs are trying to minimize mse. These are the 2 losses out of the 3: activation_26_loss which is the loss for the final Dense layer and lstm_151_loss which is the loss from the LSTM cell state. Keras just gives random names to these layers with numbers unless specified properly.
The loss mentioned is basically the weighted average of the other 2 losses. Ill talk about this more later.
The metrics=['accuracy'] is just a metric for users to track. Since there are 2 outputs, you get 2 different accuracy metrics, one for each output. They don't affect the neural network's training.
Now, when working with neural networks, it's important to know which loss to use where. Here is a table describing what loss and activation functions to use for which type of network.
As you can see, it's a good practice to use softmax and categorical_crossentropy for multi-class problems. So let's try to recreate the model with this change. We want each output to have a different loss to minimize.
Also, let's say the first output is more important than the second. We can also tell the model how to weigh the losses so that it prioritizes which loss to focus on more and by how much.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('softmax')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<--- Softmax for first outputs activation
model.compile(optimizer='adam',
loss=['categorial_crossentropy','mse'], #<--- 2 losses, one for each output
loss_weights=[0.4, 0.6]) #<--- 2 loss weights for final loss
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
utils.plot_model(model, show_layer_names=False, show_shapes=True)
Here, the final loss (named simply as loss) is the combination of the 2 separate losses after combining them with 0.4 and 0.6 weights.
Hope this clarifies what you are trying to achieve.
ONE A SIDE NOTE: I am curious as to how you are getting the actual values for the final cell state to train the model to predict a cell state. Do let me know if that is what your intention is. It's not very clear what your final goal here is (as I had asked your previous question as well).
I'm working on a LSTM model in Keras with the goal of next word prediction utilizing BERT word vectors as a part of my inputs for the model.
This is a multi-class categorical problem, and I've done some weird steps to simplify English into clusters of words using BERT and stop-words and k-means, and for my initial practice model I'm using 144 target categories. I plan to up that to about 1000 after working out some kinks.
Here's the architecture of my Keras model:
model = Sequential()
model.add(LSTM(32, input_shape=(SENTENCE_LENGTH, COM_WORDS), dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(COM_WORDS))
model.add(Activation('softmax'))
optimizer = Adam(lr=lr)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.fit(X, y, validation_split=0.05, batch_size=128, epochs=epochs)
My loss starts arounds around 6 and goes down, which isn't unusual as far as I know. I then tried to incorporate class weights, since the model was over-predicting common words like 'the', which is expected. so I used this code to make the weights:
max_count = 0
for word in range(COM_WORDS):
if Ys.count(word) > max_count:
max_count = Ys.count(word)
class_weights = {}
for word in range(COM_WORDS):
class_weights[word] = (max_count - Ys.count(word) + 1)
So my most common y-input would have a value of 1 in the dictionary, and an y-input that is only represented once would be weighted at the count of the most common y-input: around 1 million in this case. Then I added it to my fit() and restarted the model.
When I run my model with the weights, i get insanely high loss (this is just a batch of 100,000 of all my inputs being run):
Epoch 1/3
950000/950000 [==============================] - 160s 168us/step - loss: 3014409.5359 - acc: 0.1261 - val_loss: 2808283.0898 - val_acc: 0.1604
The accuracy is fine though! Not too different than when I didn't use weights.
MY QUESTION(s):
Does this high loss matter? Is it just a reflection of my huge weight numbers, or is it indicating something sinister? Are loss numbers relative?
Side question: Should I use a better method to weight my inputs?
Thank you!
Intro and questions:
I'm trying to make a one-class classification convolutional neural network. By one-class I mean I have one image dataset containing about 200 images of Nicolas Cage. By one class classification I mean look at an image and predict 1 if Nicolas Cage is contained in this image and predict 0 Nicolas Cage is not contained in the image.
I’m a definitely a machine learning/deep learning beginner so I was hoping someone with some more knowledge and experience could help guide me in the right direction. Here are my issues and questions right now. My network is performing terribly. I’ve tried making a few predictions with images of Nicolas Cage and it predicts 0 every single time.
Should I collect more data for this to work? I’m performing data augmentations with a small dataset of 207 images. I was hoping the data augmentations would help the network generalize but I think I was wrong
Should I try tweaking the amount of epochs, step per epoch, val steps, or the optimization algorithm I’m using for gradient descent? I’m using Adam but I was thinking maybe I should try stochastic gradient descent with different learning rates?
Should I add more convolution or dense layers to help my network better generalize and learn?
Should I just stop trying to do one class classification and go to normal binary classification because using a neural network with one class classification is not very feasible? I saw this post here one class classification with keras and it seems like the OP ended up using an isolation forest. So I guess I could try using some convolutional layers and feed into an isolation forest or an SVM? I could not find a lot of info or tutorials about people using isolation forests with one-class image classification.
Dataset:
Here is a screenshot of what my dataset looks like that I’ve collected use a package called google-images-download. It contains about 200 images of Nicolas Cage. I did two searches to download 500 images. After manually cleaning the images I was down to 200 quality pictures of Nic Cage.
Dataset
The imports and model:
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Activation
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (200, 200, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Flatten())
classifier.add(Dense(units = 64, activation = 'relu'))
classifier.add(Dropout(0.5))
# output layer
classifier.add(Dense(1))
classifier.add(Activation('sigmoid'))
Compiling and image augmentation
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('/Users/ginja/Desktop/Code/Nic_Cage/Small_Dataset/train/',
target_size = (200, 200),
batch_size = 32,
class_mode = "binary")
test_set = test_datagen.flow_from_directory('/Users/ginja/Desktop/Code/Nic_Cage/Small_Dataset/test/',
target_size = (200, 200),
batch_size = 32,
class_mode = "binary")
Fitting the model
history = classifier.fit_generator(training_set,
steps_per_epoch = 1000,
epochs = 25,
validation_data = test_set,
validation_steps = 500)
Epoch 1/25
1000/1000 [==============================] - 1395s 1s/step - loss: 0.0012 - acc: 0.9994 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 2/25
1000/1000 [==============================] - 1350s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 3/25
1000/1000 [==============================] - 1398s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 4/25
1000/1000 [==============================] - 1342s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 5/25
1000/1000 [==============================] - 1327s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 6/25
1000/1000 [==============================] - 1329s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
.
.
.
The model looks like it converges to a loss value of 1.0000e-07 as this doesn't change for the rest of the epochs
Training and Test accuracy plotted
Training and Test accuracy
Training and Test loss plotted
Training and Test loss
Making the prediction
from keras.preprocessing import image
import numpy as np
test_image = image.load_img('/Users/ginja/Desktop/Code/Nic_Cage/nic_cage_predict_1.png', target_size = (200, 200))
#test_image.show()
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = classifier.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
prediction = 'This is Nicolas Cage'
else:
prediction = 'This is not Nicolas Cage'
print(prediction)
We get 'This is not Nicolas Cage' every single time for the prediction.
I appreciate anyone that takes the time to even read through this and I appreciate any help on any part of this.
If anyone finds this from google I figured it out. I did a couple of things:
I added a dataset of random images to my train and test folders. I basically added a "0" class. These images were labeled as "not_nicolas" I downloaded the same amount of images I had in the first dataset which was about 200 images. So I had 200 images of Nicolas Cage and 200 images of random stuff. The random pictures were generated at this link https://picsum.photos/200/200/?random I just used a python script to generate 200 images. Make sure when you use flow_from_directory it reads the folders in alphanumeric order. So the first folder in the directory will be class "0". Took me way too long to figure that out.
path = "/Users/ginja/Desktop/Code/Nic_Cage/Random_images"
for i in range(200):
url = "https://picsum.photos/200/200/?random"
response = requests.get(url)
if response.status_code == 200:
file_name = 'not_nicolas_{}.jpg'.format(i)
file_path = path + "/" + file_name
with open(file_path, 'wb') as f:
print("saving: " + file_name)
f.write(response.content)
I changed the optimizer to Stochastic Gradient Descent instead of Adam.
I added shuffle = True as a parameter in the flow_from_directory to shuffle our images to allow our network to generalize better
I now have a training accuracy of 99% and a Test accuracy of 91% and I am able to predict images of Nicolas Cage successfully!
Everyone leans towards a binary classification approach. This may be a solution but removes the fundamental design objective which may be to solve it with a one class classifier.
Depending on what you want to achieve with a one-class classifier it can be an ill-conditioned problem.
In my experience, your last point often applies.
As mentioned in https://arxiv.org/pdf/1801.05365.pdf:
In the classical multiple-class classification, features are learned with the objective of maximizing inter-class distances between classes and minimizing intra-class variances within classes [2]. How-ever, in the absence of multiple classes such a discriminative approach is not possible.
It yields a trivial solution. The reason why is explained a bit later:
The reason why this approach ends up yielding a trivial solution is due to the absence of a regularizing term in the loss function that takes into account the discriminative ability of the network. For example, since all class labels are identical, a zero loss can be obtained by making all weights equal to zero. It is true that this is a valid solution in the closed world where onlynormal chairobjects exist. But such a network has zero discriminative ability whenabnormal chairobjects appear
Note that the description here is made with regards to attempting to use one class classifiers to solve for different classes. One other useful objective of one class classifiers is to detect anomaly in e.g. factory operation signals. This is what I am currently working on. In such cases, having knowledge regarding the various damage states is very hard to obtain. It would be ridiculous to break a machine just to see how it operates when broken so that a decent multinomial classifier can be made. One solution to the problem is described in the following: https://arxiv.org/abs/1912.12502. Note that in this paper, because of the stochastic similarity of the classes, the descriminative capacity of classes is achieved as well.
I found that by following the guidelines described and specially, removing the last activation function, I got my one-class classifier working and the acuraccy did not give 0 values. Note that in your case you may also want to remove to binary-cross entropy since that requires binary inputs to make sense (use RMSE).
This method should also work for your case. In that case the network would be capable of determining which photos are numerically further away from the training photo class. In my experience however, it is likely still a hard problem to solve due to the variance contained in the pictures e.g. different background, angles, etc... To that end, the problem I am solving is much easier as there is much more similarity between operating conditions of the same condition stage. To put that into analogy, in my case the training class is more like the same picture with different noise levels and only slight movements of objects.
Treating your problem as supervised problem:
You are solving a face recognition problem. Your problem is binary classification problem if you want to distinguish between "Nicolas Cage" or any other random image. For binary classification you need to have a class with 0 label or not "Nicolas Cage" class.
If I take a very famous example then it is Hotdog-Not-Hotdog problem (Silicon Valley).
These links might help you.
https://towardsdatascience.com/building-the-hotdog-not-hotdog-classifier-from-hbos-silicon-valley-c0cb2317711f
https://github.com/J-Yash/Hotdog-Not-Hotdog/blob/master/Hotdog_classifier_transfer_learning.ipynb
Treating your problem as Unsupervised problem:
In this you can represent your image into an embedding vector. Pass your Nicolas Cage image into a pre-trained facenet that will give you face embedding and plot that embedding to see the relation between every image.
https://paperswithcode.com/paper/facenet-a-unified-embedding-for-face
I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
The columns and some values of the dataset are follows:
'Person_ID', 'time_in_game', 'python_time', 'permutation_game, 'round', 'level', 'times_level_played_before', 'speed', 'costheta', 'y_label', 'gender', 'age_precise', 'ax_f', 'ay_f', 'az_f', 'acc', 'jerk'
1, 0.25, 1.497942e+09, 2, 1, 'level_B', 1, 0.8, 0.4655, 1, [...]
I reduced the dataset to only 480 rows per person, by just using the row at each half of a second.
Now I want to use a recurrent neural network to predict the binary y_label.
This code extracts the costheta feature used for the input data X and the y-label for output Y.
X = []
Y = []
for ID in person_list:
person_frame = df.loc[df['Person_ID'] == Person_ID]
# costheta is a measurement of performance
coslist = list(person_frame['costheta'])
# extract y-label
score = list(person_frame['y_label'].head(1))[0]
X.append(coslist)
Y.append(binary)
I splitted the data in to training and testing data using a 0.2 test split. Then I tried to create the RNN with Keras as follows:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
# different_input_values are the set of possible input values
model.add(Embedding(different_input_values, embedding_size, input_length=480))
model.add(LSTM(1000))
# output is binary
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
At last, I began training with this code:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 64
num_epochs = 100
X_valid, y_valid = X_train[:batch_size], Y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], Y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs).
However, the gained accuracy is really low. Depending on the batch size it varies between 0.4 and 0.6.
12/12 [==============================] - 13s 1s/step - loss: 0.6921 -
acc: 0.7500 - val_loss: 0.7069 - val_acc: 0.4219
My question is, in general, with complicated data like this, how does one efficiently train a RNN. Should one refrain from reducing the data to 480 rows per person and keep it around 25,000 rows per? Could multiple metrics, such as acc (acceleration in game) and jerk cause a significant accuracy gain? What are significant improvements that one could change and consider?