Intro and questions:
I'm trying to make a one-class classification convolutional neural network. By one-class I mean I have one image dataset containing about 200 images of Nicolas Cage. By one class classification I mean look at an image and predict 1 if Nicolas Cage is contained in this image and predict 0 Nicolas Cage is not contained in the image.
I’m a definitely a machine learning/deep learning beginner so I was hoping someone with some more knowledge and experience could help guide me in the right direction. Here are my issues and questions right now. My network is performing terribly. I’ve tried making a few predictions with images of Nicolas Cage and it predicts 0 every single time.
Should I collect more data for this to work? I’m performing data augmentations with a small dataset of 207 images. I was hoping the data augmentations would help the network generalize but I think I was wrong
Should I try tweaking the amount of epochs, step per epoch, val steps, or the optimization algorithm I’m using for gradient descent? I’m using Adam but I was thinking maybe I should try stochastic gradient descent with different learning rates?
Should I add more convolution or dense layers to help my network better generalize and learn?
Should I just stop trying to do one class classification and go to normal binary classification because using a neural network with one class classification is not very feasible? I saw this post here one class classification with keras and it seems like the OP ended up using an isolation forest. So I guess I could try using some convolutional layers and feed into an isolation forest or an SVM? I could not find a lot of info or tutorials about people using isolation forests with one-class image classification.
Dataset:
Here is a screenshot of what my dataset looks like that I’ve collected use a package called google-images-download. It contains about 200 images of Nicolas Cage. I did two searches to download 500 images. After manually cleaning the images I was down to 200 quality pictures of Nic Cage.
Dataset
The imports and model:
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Activation
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (200, 200, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Flatten())
classifier.add(Dense(units = 64, activation = 'relu'))
classifier.add(Dropout(0.5))
# output layer
classifier.add(Dense(1))
classifier.add(Activation('sigmoid'))
Compiling and image augmentation
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('/Users/ginja/Desktop/Code/Nic_Cage/Small_Dataset/train/',
target_size = (200, 200),
batch_size = 32,
class_mode = "binary")
test_set = test_datagen.flow_from_directory('/Users/ginja/Desktop/Code/Nic_Cage/Small_Dataset/test/',
target_size = (200, 200),
batch_size = 32,
class_mode = "binary")
Fitting the model
history = classifier.fit_generator(training_set,
steps_per_epoch = 1000,
epochs = 25,
validation_data = test_set,
validation_steps = 500)
Epoch 1/25
1000/1000 [==============================] - 1395s 1s/step - loss: 0.0012 - acc: 0.9994 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 2/25
1000/1000 [==============================] - 1350s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 3/25
1000/1000 [==============================] - 1398s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 4/25
1000/1000 [==============================] - 1342s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 5/25
1000/1000 [==============================] - 1327s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
Epoch 6/25
1000/1000 [==============================] - 1329s 1s/step - loss: 1.0000e-07 - acc: 1.0000 - val_loss: 1.0000e-07 - val_acc: 1.0000
.
.
.
The model looks like it converges to a loss value of 1.0000e-07 as this doesn't change for the rest of the epochs
Training and Test accuracy plotted
Training and Test accuracy
Training and Test loss plotted
Training and Test loss
Making the prediction
from keras.preprocessing import image
import numpy as np
test_image = image.load_img('/Users/ginja/Desktop/Code/Nic_Cage/nic_cage_predict_1.png', target_size = (200, 200))
#test_image.show()
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = classifier.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
prediction = 'This is Nicolas Cage'
else:
prediction = 'This is not Nicolas Cage'
print(prediction)
We get 'This is not Nicolas Cage' every single time for the prediction.
I appreciate anyone that takes the time to even read through this and I appreciate any help on any part of this.
If anyone finds this from google I figured it out. I did a couple of things:
I added a dataset of random images to my train and test folders. I basically added a "0" class. These images were labeled as "not_nicolas" I downloaded the same amount of images I had in the first dataset which was about 200 images. So I had 200 images of Nicolas Cage and 200 images of random stuff. The random pictures were generated at this link https://picsum.photos/200/200/?random I just used a python script to generate 200 images. Make sure when you use flow_from_directory it reads the folders in alphanumeric order. So the first folder in the directory will be class "0". Took me way too long to figure that out.
path = "/Users/ginja/Desktop/Code/Nic_Cage/Random_images"
for i in range(200):
url = "https://picsum.photos/200/200/?random"
response = requests.get(url)
if response.status_code == 200:
file_name = 'not_nicolas_{}.jpg'.format(i)
file_path = path + "/" + file_name
with open(file_path, 'wb') as f:
print("saving: " + file_name)
f.write(response.content)
I changed the optimizer to Stochastic Gradient Descent instead of Adam.
I added shuffle = True as a parameter in the flow_from_directory to shuffle our images to allow our network to generalize better
I now have a training accuracy of 99% and a Test accuracy of 91% and I am able to predict images of Nicolas Cage successfully!
Everyone leans towards a binary classification approach. This may be a solution but removes the fundamental design objective which may be to solve it with a one class classifier.
Depending on what you want to achieve with a one-class classifier it can be an ill-conditioned problem.
In my experience, your last point often applies.
As mentioned in https://arxiv.org/pdf/1801.05365.pdf:
In the classical multiple-class classification, features are learned with the objective of maximizing inter-class distances between classes and minimizing intra-class variances within classes [2]. How-ever, in the absence of multiple classes such a discriminative approach is not possible.
It yields a trivial solution. The reason why is explained a bit later:
The reason why this approach ends up yielding a trivial solution is due to the absence of a regularizing term in the loss function that takes into account the discriminative ability of the network. For example, since all class labels are identical, a zero loss can be obtained by making all weights equal to zero. It is true that this is a valid solution in the closed world where onlynormal chairobjects exist. But such a network has zero discriminative ability whenabnormal chairobjects appear
Note that the description here is made with regards to attempting to use one class classifiers to solve for different classes. One other useful objective of one class classifiers is to detect anomaly in e.g. factory operation signals. This is what I am currently working on. In such cases, having knowledge regarding the various damage states is very hard to obtain. It would be ridiculous to break a machine just to see how it operates when broken so that a decent multinomial classifier can be made. One solution to the problem is described in the following: https://arxiv.org/abs/1912.12502. Note that in this paper, because of the stochastic similarity of the classes, the descriminative capacity of classes is achieved as well.
I found that by following the guidelines described and specially, removing the last activation function, I got my one-class classifier working and the acuraccy did not give 0 values. Note that in your case you may also want to remove to binary-cross entropy since that requires binary inputs to make sense (use RMSE).
This method should also work for your case. In that case the network would be capable of determining which photos are numerically further away from the training photo class. In my experience however, it is likely still a hard problem to solve due to the variance contained in the pictures e.g. different background, angles, etc... To that end, the problem I am solving is much easier as there is much more similarity between operating conditions of the same condition stage. To put that into analogy, in my case the training class is more like the same picture with different noise levels and only slight movements of objects.
Treating your problem as supervised problem:
You are solving a face recognition problem. Your problem is binary classification problem if you want to distinguish between "Nicolas Cage" or any other random image. For binary classification you need to have a class with 0 label or not "Nicolas Cage" class.
If I take a very famous example then it is Hotdog-Not-Hotdog problem (Silicon Valley).
These links might help you.
https://towardsdatascience.com/building-the-hotdog-not-hotdog-classifier-from-hbos-silicon-valley-c0cb2317711f
https://github.com/J-Yash/Hotdog-Not-Hotdog/blob/master/Hotdog_classifier_transfer_learning.ipynb
Treating your problem as Unsupervised problem:
In this you can represent your image into an embedding vector. Pass your Nicolas Cage image into a pre-trained facenet that will give you face embedding and plot that embedding to see the relation between every image.
https://paperswithcode.com/paper/facenet-a-unified-embedding-for-face
Related
I am trying to train a neural network, supervised, with a big set of data (let say over 1 mio samples).
The NN should solve a regression problem; it takes 4 input numerical values and predicts one numerical value. The values are scaled between 0 and 1.
Each sample in the training set looks like:
Input-set->
[
[0.47860402, 0.31794003, 0.00013333333, 0.00026666667],
[0.47860357, 0.31794018, 0.00013333333, 0.00026666667],
…
[0.47859928, 0.317943, 0.00013333333, 0.00026666667]
]
Output-Set ->
[0.657721
0.65772104
0.6577211
...
0.69796
0.69796
0.69796 ]
As you can see there are really small changes between each datasample, like in the 6th place after point.
I am using the following model
model = keras.Sequential([
keras.layers.Dense(50,input_shape=(4,),activation="relu",kernel_initializer ='he_uniform'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.1),
keras.layers.Dense(25,activation="relu",kernel_initializer ='he_uniform'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.1),
keras.layers.Dense(1,activation="relu"),
])
opt = keras.optimizers.Adam(lr=0.005)
model.compile(loss='mse',optimizer=opt)
I am struggling to find a good model, which could solve the given problem. In some predictions I get just a straight line instead of a curve.
My question is, do I use a big learning rate ? Should I use a learning rate like 0.00001 because of the small changes ? Or is there another problem ? I am not experienced in machine learning so I hope an expert here could give me some ideas :)
The obvious problem is you are using relu in the last dense layer. You need to make your last dense layer a linear layer. With other words, if you are doing regression, you don't need to use non-linearity function like relu in the last dense layer.
Also, you really don't need to use regularization techniques unless you are over-fitting but I did not remove them here.
I changed model as:
model = tf.keras.Sequential([
tf.keras.layers.Dense(50,input_shape=(4,),activation="relu",kernel_initializer ='he_uniform'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(25,activation="relu",kernel_initializer ='he_uniform'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(1),
])
opt = tf.keras.optimizers.RMSprop() # you can also use adam
model.compile(loss='mse',optimizer=opt)
Some epochs:
Epoch 39/100
32/32 [==============================] - 0s 5ms/step - loss: 0.0812 - val_loss: 0.0888
Epoch 40/100
32/32 [==============================] - 0s 5ms/step - loss: 0.0833 - val_loss: 0.0864
Predictions:
y_hat = model.predict(X_test)
y_hat[:10]
array([[0.47433853],
[0.4804499 ],
[0.4659953 ],
[0.4893798 ],
[0.38975602],
[0.53456545],
[0.5105466 ],
[0.45408142],
[0.4651251 ],
[0.5104909 ]], dtype=float32)
My model is like this:
def _get_model(input_shape, latent_dim, num_classes):
inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)
during training I get:
Epoch 1/2000
1/1 [==============================] - 1s 698ms/step - loss: 0.2338 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1185 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2341 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1181 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 2/2000
1/1 [==============================] - 0s 34ms/step - loss: 0.2328 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1175 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2329 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1169 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 3/2000
1/1 [==============================] - 0s 38ms/step - loss: 0.2316 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1163 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2315 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1155 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
when i see history:
print (history.history.keys)
dict_keys(['loss', 'activation_26_loss', 'lstm_151_loss', 'activation_26_accuracy', 'lstm_151_accuracy', 'val_loss', 'val_activation_26_loss', 'val_lstm_151_loss', 'val_activation_26_accuracy', 'val_lstm_151_accuracy'])
which ones are the training loss and training accuracy?
Since there are only 2 outputs, why are there 3 losses,loss,activation_26_lossand lstm_151_loss BUT 2 accuracies:activation_26_accuracy and lstm_151_accuracy? what is each loss and each accuracy standing for?
TLDR;
Three losses (2+1), two losses for individual outputs, and one as the combination of the 2 losses weighed by 0.5 each. You can set both the losses explicitly and their weights as well.
Two accuracies since there are 2 outputs. metrics are just for the user to view and don't affect the neural network.
Detailed explanation;
Let's try to see what you are doing here first. (I am referring to the previous question you asked to get the shapes for inputs.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<------- One input, 2 outputs
model.compile(optimizer='adam', loss='mse')
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
You are building a supervised model that takes an input of (15,5) shape and outputs 2 things: first a (7,) which should contain the cell_states from the 7 LSTM cells and second a (4,) vector that should contain probability values for the 4 classes. The loss you are using to train the model for learning how to predict both of the outputs is mse.
Since this is a supervised model, you will have to provide the model samples of inputs and outputs. If you have 100 samples then your inputs would be (100,15,5) shaped and your outputs will be (100,7) and (100,4), since you have 2 outputs.
Loss(y_actual, y_pred) is a function that tells the neural network how far is its prediction from the actual value. Based on this, it tells the neural network to update itself (its weights specifically using backpropagation) so that its predictions become closer and closer to actual and thus reduce the Loss.
If the above points are clear then let's look at what this network is doing specifically
Your current model has one input and 2 outputs.
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
Since you have defined mse as loss, both the outputs are trying to minimize mse. These are the 2 losses out of the 3: activation_26_loss which is the loss for the final Dense layer and lstm_151_loss which is the loss from the LSTM cell state. Keras just gives random names to these layers with numbers unless specified properly.
The loss mentioned is basically the weighted average of the other 2 losses. Ill talk about this more later.
The metrics=['accuracy'] is just a metric for users to track. Since there are 2 outputs, you get 2 different accuracy metrics, one for each output. They don't affect the neural network's training.
Now, when working with neural networks, it's important to know which loss to use where. Here is a table describing what loss and activation functions to use for which type of network.
As you can see, it's a good practice to use softmax and categorical_crossentropy for multi-class problems. So let's try to recreate the model with this change. We want each output to have a different loss to minimize.
Also, let's say the first output is more important than the second. We can also tell the model how to weigh the losses so that it prioritizes which loss to focus on more and by how much.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('softmax')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<--- Softmax for first outputs activation
model.compile(optimizer='adam',
loss=['categorial_crossentropy','mse'], #<--- 2 losses, one for each output
loss_weights=[0.4, 0.6]) #<--- 2 loss weights for final loss
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
utils.plot_model(model, show_layer_names=False, show_shapes=True)
Here, the final loss (named simply as loss) is the combination of the 2 separate losses after combining them with 0.4 and 0.6 weights.
Hope this clarifies what you are trying to achieve.
ONE A SIDE NOTE: I am curious as to how you are getting the actual values for the final cell state to train the model to predict a cell state. Do let me know if that is what your intention is. It's not very clear what your final goal here is (as I had asked your previous question as well).
This is a model I've been using. It takes a pretrained InceptionV3 model and adds some fully connected layers on top of it. The whole thing is made trainable (including the pretrained InceptionV3 layers).
with tf.device('/cpu:0'):
pretrained_model = InceptionV3(weights='imagenet', include_top=False)
x = pretrained_model.output
x = GlobalAveragePooling2D(name='gap_final')(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
preds = Dense(len(config.classes),activation='softmax')(x)
model = Model(inputs=pretrained_model.input, outputs=preds)
parallel_model = multi_gpu_model(model, gpus=16)
parallel_model.compile(optimizer=Adam(lr=0.0005), loss='categorical_crossentropy', metrics=['accuracy'])
I've tried training it with different image augmentation configurations, and no matter what I do the results are always similar to below:
Epoch 1/20
181/181 [====] - 1372s 8s/step - loss: 19.2332 - acc: 0.3330 - val_loss: 8.7765 - val_acc: 0.4747
Epoch 2/20
181/181 [====] - 1379s 8s/step - loss: 4.9885 - acc: 0.5474 - val_loss: 3.5256 - val_acc: 0.4084
Epoch 3/20
181/181 [====] - 1354s 7s/step - loss: 2.0334 - acc: 0.6469 - val_loss: 2.5382 - val_acc: 0.4275
Epoch 4/20
181/181 [====] - 1361s 8s/step - loss: 1.3522 - acc: 0.7117 - val_loss: 2.2028 - val_acc: 0.4741
Epoch 5/20
181/181 [====] - 1356s 7s/step - loss: 1.0838 - acc: 0.7599 - val_loss: 2.3402 - val_acc: 0.4738
From this point on (epoch 5/20), if I let the model train forever the training loss/acc will keep improving while the validation loss/acc will keep stagnating at these values.
This is a classification problem with 28 different classes, so a validation accuracy of 0.47 is not that bad given randomness would give an accuracy of 0.035, however I don't understand how the training set can be so perfectly fitted while the validation set leaves that much to be desired.
The total dataset is made of 32,000 pretty well-labeled images, all in the same configuration (think facial classification problem). Training takes roughly 27,000 and augment them by horizontal flipping and greyscaling (giving a total of 93,000 training images), while validation images are not augmented. From a visual perspective, training and validation images look very similar and I notice no striking difference between these two sets (before augmenting the training set, obviously).
Classes are slightly unbalanced, but not that much: the biggest class has 2,600 images and smallest has 610 (class size distribution is linear between these two extremes).
Note a few things that I've tried and don't alter the results:
dropouts: little impact if I play around with the dropout rates
regularization: using L1 or L2 with different values don't change results much
batch normalisation: with or without, same thing
number of fully-connected layers: one, two or even three (like above), little difference
type of pretrained network: I've tried using VGG16 with similar results
No matter what I do, training metrics always improve significantly, while validation stagnates.
Is it only a problem of "getting more data in" with 32,000 images just being "not enough" for 28 classes, especially for currently smaller classes (e.g. the one which has currently 610 images) or am I doing something wrong? Should I use a smaller learning rate, although the one being used currently is already fairly small?
Is it wrong to augment images from the training set and not from the validation set? I've read that it's standard practice, and it also seems to make sense to be doing so...
Lastly, should I limit the layers being trainable? E.g. should I make only the last 10 or 20 layers trainable instead of the full InceptionV3 network? Although choosing the trainable layers is straightforward when using a VGGxx model (being purely sequential), it seems a bit trickier for Inception. Any recommendation regarding this would be welcome.
After having tried several models and had a more thorough look at the data, it seems that the labels are not as clear as what I thought, and there is a lot of porosity between the different 28 classes.
Every time the model makes a "wrong" prediction on test data, a careful inspection of the picture makes it apparent that the model was "somehow right" and the labeling was questionable. E.g. think of a face smiling and frowning at the same time. Model could say "happy" or "unhappy" with equal legitimacy, and the "ground truth" labelling would be pretty arbitrary.
So, it seems that 45-ish percent accuracy on the validation set is in the top of what any model (or any human) could get to considering these porous classes.
The ability of InspectionV3 to get to the 85% accuracy for the training set with tens of thousands of images — after one epoch — is saying something about its power to find specific patterns that a human couldn't. As this example indicates, this ability must be balanced with equally qualitative regularization.
It also means that given a high-quality dataset with little porosity between labels, InceptionV3 should be able to give good results very quickly, e.g. compared to VGG16.
I have recently attempted to complete a neural network to predict fluctuations within the prices of individual stocks on the stock market, utilising Keras as the framework for the network and Quandl to retrieve the historic adjusted stock prices; within running this program, I primarily utilised the program paradigm and information displayed within a singular tutorial, a link to which is displayed below:
https://www.youtube.com/watch?v=EYnC4ACIt2g&t=2079s
However, the tutorial utilised the "sklearn" Linear Regression module; I altered the program to utilise Keras, which possesses a greater capability for customisation. The program is displayed below:
import tensorflow as tf
import keras
import numpy as np
import quandl
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
df = quandl.get("WIKI/FB")
df = df[['Adj. Close']]
forecast_out = 1
df['Prediction'] = df[['Adj. Close']].shift(-(forecast_out))
X = np.array(df.drop(['Prediction'], 1))
X = X[:-forecast_out]
y = np.array(df['Prediction'])
y = y[:-forecast_out]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
model = keras.models.Sequential()
model.add(keras.layers.Dense(units = 64, activation = 'relu'))
model.add(keras.layers.Dense(units = 1, activation = 'linear'))
model.compile(loss='mean_absolute_error',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split = 0.2)
x_forecast = np.array(df.drop(['Prediction'], 1))[-forecast_out:]
print(x_forecast)
prediction = model.predict(x_train)
However, upon running the model with the provided testing information via the model.fit() command, I received this display of the loss and accuracy for each epoch:
Train on 940 samples, validate on 236 samples
Epoch 1/5
940/940 [==============================] - 1s 831us/step - loss: 85.4464 - acc: 0.0000e+00 - val_loss: 76.7483 - val_acc: 0.0000e+00
Epoch 2/5
940/940 [==============================] - 0s 51us/step - loss: 65.6871 - acc: 0.0000e+00 - val_loss: 55.4325 - val_acc: 0.0000e+00
Epoch 3/5
940/940 [==============================] - 0s 52us/step - loss: 43.3484 - acc: 0.0000e+00 - val_loss: 30.5538 - val_acc: 0.0000e+00
Epoch 4/5
940/940 [==============================] - 0s 47us/step - loss: 16.5076 - acc: 0.0011 - val_loss: 1.3096 - val_acc: 0.0042
Epoch 5/5
940/940 [==============================] - 0s 47us/step - loss: 2.0529 - acc: 0.0043 - val_loss: 1.1567 - val_acc: 0.0000e+00
<keras.callbacks.History at 0x7ff1dfa19470>
Provided my relatively small quantity of experience in testing such paradigms, I would prefer to know if this accuracy is satisfactory or not; are the loss and accuracy parameters indicating that the model is running perfectly? What is the variation between them and how may one read them? Lastly, how does Keras characterise them? The documentation for the module did not appear to provide a sufficient quantity of information; however, it may be my examination of them to blame. Thank you for your assistance.
You might get better answers about neural networks/ML on CrossValidated, but I can try to help you here.
In general, it is very very difficult to tell if a neural network is running "properly" — hence in my experience, ML development is a very iterative process with trial and error informed by educated statistical/mathematical guesses.
Let's do a high-level overview of the metrics first:
Loss = how far "off" the model's prediction is from your data.
Accuracy = % of predictions that your model got "right"; i.e. if your model is a function, model(x) = y for a particular data point.
Satisfactory "accuracy" is subjective and depends widely by application/model/your data. However, since you are trying to predict stock price; i.e. a continuous variable, you are doing regression, it doesn't make much sense to me to use a metric like accuracy. I can tell you are doing regression both by your problem formulation — and the linear activation is a strong hint as well.
To see why accuracy doesn't make sense, if I'm predicting the house price based on certain factors, I probably don't care how many predictions I got exactly right, but more how close my predictions overall were. If my regression model $1 off on every house price, I still have an accuracy of 0, but I could potentially still have a good model.
Instead, minimizing the loss function is probably a better way to think of things. Think of it this way: overall, you want to fit some function of your input variables that is "close" to the ground-truth output. For linear regression, the loss function is LMS (Least Mean Squares), which essentially the average squared distance of the residuals. Here you use mean absolute error, which is just the average absolute value of the difference. Both loss functions have pros and cons and I'd encourage you to look into this for your application.
The fact that your error is decreasing is good: this means that the function your model approximates is getting closer and closer to the training data (the residuals are decreasing). Your validation loss is also no greater than the training data, which demonstrates that you are not overfitting on your data either. I would encourage you to simply keep experimenting.
Currently I am trying to implement a capsule network using Xifeng Guo's Keras code for capsule nets. I have a dataset of brain tumor images with 98 negatively labeled instances and 155 positively labeled instances. I would like to use the capsnet to predict either positive or negative for a brain tumor on the image. Unfortunately I cannot figure out why it is not going beyond a set accuracy / loss. I have attempted data augmentation to increase the dataset size, with a 50/50 prediction as a result.
I have read the paper on 'Capsule Networks against Medical Imaging Data Challenges', where they did a capsule net implementation on, amongst others, the DIARETDB1 dataset, which comprises of only 89 images, and it gets decent predictions, even without data augmentation (0.887 F1 score on imbalanced scenario 1). This makes me believe maybe something is going wrong in the network. FYI: My images are normalized and cropped.
Any input is appreciated!
%pylab inline
import os
import numpy as np
import tensorflow as tf
import keras
import keras.backend as K
from capsulelayers import CapsuleLayer, PrimaryCap, Length, Mask
from keras import layers, models, optimizers
from keras.applications import vgg16
from keras.layers import Conv2D, MaxPooling2D
K.set_image_data_format('channels_last')
def CapsNet(input_shape, n_class, routings):
x = layers.Input(shape=input_shape)
# Layer 1: Just a conventional Conv2D layer
conv1 = Conv2D(filters=256, kernel_size=9, strides=1, padding='valid', activation='relu', name='conv1')(x)
# Layer 2: Conv2D layer with `squash` activation, then reshape to [None, num_capsule, dim_capsule]
primarycaps = PrimaryCap(conv1, dim_capsule=8, n_channels=32, kernel_size=9, strides=2, padding='valid')
# Layer 3: Capsule layer. Routing algorithm works here.
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(primarycaps)
# Layer 4: This is an auxiliary layer to replace each capsule with its length. Just to match the true label's shape.
# If using tensorflow, this will not be necessary. :)
out_caps = Length(name='capsnet')(digitcaps) # CAN WE EXCLUDE THIS IN KERAS TOO?
# Decoder network.
y = layers.Input(shape=(n_class,))
masked_by_y = Mask()([digitcaps, y]) # The true label is used to mask the output of capsule layer. For training
masked = Mask()(digitcaps) # Mask using the capsule with maximal length. For prediction
# Shared Decoder model in training and prediction
decoder = models.Sequential(name='decoder')
decoder.add(layers.Dense(512, activation='relu', input_dim=16*n_class))
decoder.add(layers.Dense(1024, activation='relu'))
decoder.add(layers.Dense(np.prod(input_shape), activation='sigmoid'))
decoder.add(layers.Reshape(target_shape=input_shape, name='out_recon'))
# Models for training and evaluation (prediction)
train_model = models.Model([x, y], [out_caps, decoder(masked_by_y)])
eval_model = models.Model(x, [out_caps, decoder(masked)])
# manipulate model
noise = layers.Input(shape=(n_class, 16))
noised_digitcaps = layers.Add()([digitcaps, noise])
masked_noised_y = Mask()([noised_digitcaps, y])
manipulate_model = models.Model([x, y, noise], decoder(masked_noised_y))
return train_model, eval_model, manipulate_model
def margin_loss(y_true, y_pred):
"""
Margin loss for Eq.(4). When y_true[i, :] contains not just one `1`, this loss should work too. Not test it.
:param y_true: [None, n_classes]
:param y_pred: [None, num_capsule]
:return: a scalar loss value.
"""
L = y_true * K.square(K.maximum(0., 0.9 - y_pred)) + \
0.5 * (1 - y_true) * K.square(K.maximum(0., y_pred - 0.1))
return K.mean(K.sum(L, 1))
model, eval_model, manipulate_model = CapsNet(input_shape=x_train.shape[1:],
n_class=1,
routings=2)
# compile the model
model.compile(optimizer=optimizers.Adam(lr=3e-3),
loss=[margin_loss, 'mse'],
metrics={'capsnet': 'accuracy'})
model.summary()
history = model.fit(
[x_train, y_train],[y_train,x_train],
batch_size=16,
epochs=30,
validation_data=([x_val, y_val], [y_val, x_val]),
shuffle=True)
The result is plenty of epochs where neither the accuracy nor the loss really changes:
Epoch 1/30
161/161 [==============================] - 12s 77ms/step - loss: 0.2700 - capsnet_loss: 0.1911 - decoder_loss: 0.0789 - capsnet_acc: 0.5901 - val_loss: 0.2153 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0565 - val_capsnet_acc: 0.6078
Epoch 2/30
161/161 [==============================] - 9s 56ms/step - loss: 0.2046 - capsnet_loss: 0.1560 - decoder_loss: 0.0486 - capsnet_acc: 0.6149 - val_loss: 0.2015 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0427 - val_capsnet_acc: 0.6078
Epoch 3/30
161/161 [==============================] - 9s 56ms/step - loss: 0.1960 - capsnet_loss: 0.1560 - decoder_loss: 0.0401 - capsnet_acc: 0.6149 - val_loss: 0.1982 - val_capsnet_loss: 0.1588 - val_decoder_loss: 0.0394 - val_capsnet_acc: 0.6078
There exist two vector transformation procedure to obtain capsules from convolutions namely, Matrix vector transformation and convolutional vector transformation. Since you are having small amount of data, it is better to use convolutional vector transformation which is better in this case.
I advise you to introduce a batch normalization layer under the first convolutional layer and see what it gives.
I had the same problem with training the capsule network on some datasets in which the training process did not converge.
I accidentally reduced the Adam learning rate default parameter from 0.001 to 0.000001 and the problem was solved.
So, I think this parameter plays an important role here.