I am trying to process the data obtained from YOLO v5, which is an array of 4 values (posX, posY, SizX, Sizy) per object detected. Now, I know several detections are related, and I want a neural network to find this relation. For every array input, it should return as output a 2x4 matrix, or, flattened, an array of size 8. I am training 4017 samples using Keras Sequential model:
model = Sequential()
model.add(layers.Dense(256, activation="relu", name="layer1"))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(layers.Dense(592, activation="relu", name="layer2"))
model.add(Dense(8))
model.add(Activation('softmax'))
sgd = tf.keras.optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
hist=model.fit(X, y, batch_size=48, epochs=50, validation_split=0.2)
But the results I am getting are not good:
Epoch 20/20
80/80 [==============================] - 1s 10ms/step - loss: 0.5413 - accuracy: 0.9963 - val_loss: 0.5414 - val_accuracy: 0.9937
Where the predictions are:
Input: [0.50070833 0.50070833 0.42683333 0.22983333]
Expected Output: [[0.591 0.50070833 0.04514583 0.25035417]
[0.50070833 0.34475 0.44735417 0.04514583]]
NeuralNetwork Output: [[0.28618604 0.18969838 0.00889739 0.06283922]
[0.18952993 0.09904755 0.15489812 0.00890343]]
Adding/suppresing layers/BatchNormalization/Dropout is not making any difference and changing lossfunction/optimizer is only worsening the result. Do you have any advice or solution to this problem?
Softmax activation at the output is used for classification problems, you do not seem to be doing classification, but regression (since the output is continuous, not discrete).
Then you should change your activation at the output to something appropriate, like linear (general solution) or sigmoid (if the targets are in the [0, 1] range).
Keras Binary Classifier Tutorial Example gives only 50% validation accuracy.
The near 50% accuracy can be gotten from an un-trained classifier itself for binary classification.
This example is straight from https://keras.io/getting-started/sequential-model-guide/
import numpy as np
import tensorflow as tf
from tensorflow_core.python.keras.models import Sequential
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
np.random.seed(10)
# Generate dummy data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((800, 20))
y_test = np.random.randint(2, size=(800, 1))
model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=50,
batch_size=128,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, batch_size=128)
Accuracy output.
I tried with multiple trials.
Increased the number of hidden layers
Epoch 50/50 1000/1000 [==============================] - 0s
211us/sample - loss: 0.6905 - accuracy: 0.5410 - val_loss: 0.6959 -
val_accuracy: 0.4812
Could someone help me understand if anything is wrong here?
How to increase the accuracy for this "example" problem presented in the tutorial?
If you train a classifier with random examples, you will always get aprrox. 50% accuracy at validation data here represented by x_test. It is because your training samples get trained with random classes. Also the validation or test set has been assigned to random classes. This is why the random accuracy i.e. 50-50% occurs.
The more epoch you test the training set the more accuracy you will get on training set as an effect of overfitting.
I am trying to train a regression model of a dummy function with 3 variables with fully connected neural nets in Keras and I always get a training loss much higher than the validation loss.
I split the data set in 2/3 for training and 1/3 for validation. I have tried lots of different things:
changing the architecture
adding more neurons
using regularization
using different batch sizes
Still the training error is one order of magnitude higer than the validation error:
Epoch 5995/6000
4020/4020 [==============================] - 0s 78us/step - loss: 1.2446e-04 - mean_squared_error: 1.2446e-04 - val_loss: 1.3953e-05 - val_mean_squared_error: 1.3953e-05
Epoch 5996/6000
4020/4020 [==============================] - 0s 98us/step - loss: 1.2549e-04 - mean_squared_error: 1.2549e-04 - val_loss: 1.5730e-05 - val_mean_squared_error: 1.5730e-05
Epoch 5997/6000
4020/4020 [==============================] - 0s 105us/step - loss: 1.2500e-04 - mean_squared_error: 1.2500e-04 - val_loss: 1.4372e-05 - val_mean_squared_error: 1.4372e-05
Epoch 5998/6000
4020/4020 [==============================] - 0s 96us/step - loss: 1.2500e-04 - mean_squared_error: 1.2500e-04 - val_loss: 1.4151e-05 - val_mean_squared_error: 1.4151e-05
Epoch 5999/6000
4020/4020 [==============================] - 0s 80us/step - loss: 1.2487e-04 - mean_squared_error: 1.2487e-04 - val_loss: 1.4342e-05 - val_mean_squared_error: 1.4342e-05
Epoch 6000/6000
4020/4020 [==============================] - 0s 79us/step - loss: 1.2494e-04 - mean_squared_error: 1.2494e-04 - val_loss: 1.4769e-05 - val_mean_squared_error: 1.4769e-05
This makes no sense, please help!
Edit: this is the full code
I have 6000 training examples
# -*- coding: utf-8 -*-
"""
Created on Mon Feb 26 13:40:03 2018
#author: Michele
"""
#from keras.datasets import reuters
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras import optimizers
import matplotlib.pyplot as plt
import os
import pylab
from keras.constraints import maxnorm
from sklearn.model_selection import train_test_split
from keras import regularizers
from sklearn.preprocessing import MinMaxScaler
import math
from sklearn.metrics import mean_squared_error
import keras
# fix random seed for reproducibility
seed=7
np.random.seed(seed)
dataset = np.loadtxt("BabbaX.csv", delimiter=",")
#split into input (X) and output (Y) variables
#x = dataset.transpose()[:,10:15] #only use power
x = dataset
del(dataset) # delete container
dataset = np.loadtxt("BabbaY.csv", delimiter=",")
#split into input (X) and output (Y) variables
y = dataset.transpose()
del(dataset) # delete container
#scale labels from 0 to 1
scaler = MinMaxScaler(feature_range=(0, 1))
y = np.reshape(y, (y.shape[0],1))
y = scaler.fit_transform(y)
lenData=x.shape[0]
x=np.transpose(x)
xtrain=x[:,0:round(lenData*0.67)]
ytrain=y[0:round(lenData*0.67),]
xtest=x[:,round(lenData*0.67):round(lenData*1.0)]
ytest=y[round(lenData*0.67):round(lenData*1.0)]
xtrain=np.transpose(xtrain)
xtest=np.transpose(xtest)
l2_lambda = 0.1 #reg factor
#sequential type of model
model = Sequential()
#stacking layers with .add
units=300
#model.add(Dense(units, input_dim=xtest.shape[1], activation='relu', kernel_initializer='normal', kernel_regularizer=regularizers.l2(l2_lambda), kernel_constraint=maxnorm(3)))
model.add(Dense(units, activation='relu', input_dim=xtest.shape[1]))
#model.add(Dropout(0.1))
model.add(Dense(units, activation='relu'))
#model.add(Dropout(0.1))
model.add(Dense(1)) #no activation function should be used for the output layer
rms = optimizers.RMSprop(lr=0.00001, rho=0.9, epsilon=None, decay=0) #It is recommended to leave the parameters
adam = keras.optimizers.Adam(lr=0.00001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=1e-6, amsgrad=False)
#of this optimizer at their default values (except the learning rate, which can be freely tuned).
#adam=keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
#configure learning process with .compile
model.compile(optimizer=adam, loss='mean_squared_error', metrics=['mse'])
# fit the model (iterate on the training data in batches)
history = model.fit(xtrain, ytrain, nb_epoch=1000, batch_size=round(xtest.shape[0]/100),
validation_data=(xtest, ytest), shuffle=True, verbose=2)
#extract weights for each layer
weights = [layer.get_weights() for layer in model.layers]
#evaluate on training data set
valuesTrain=model.predict(xtrain)
#evaluate on test data set
valuesTest=model.predict(xtest)
#invert predictions
valuesTrain = scaler.inverse_transform(valuesTrain)
ytrain = scaler.inverse_transform(ytrain)
valuesTest = scaler.inverse_transform(valuesTest)
ytest = scaler.inverse_transform(ytest)
TL;DR:
When a model is learning well and quickly the validation loss can be lower than the training loss, since the validation happens on the updated model, while the training loss did not have any (no batches) or only some (with batches) of the updates applied.
Okay I think I found out what's happening here. I used the following code to test this.
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
np.random.seed(7)
N_DATA = 6000
x = np.random.uniform(-10, 10, (3, N_DATA))
y = x[0] + x[1]**2 + x[2]**3
xtrain = x[:, 0:round(N_DATA*0.67)]
ytrain = y[0:round(N_DATA*0.67)]
xtest = x[:, round(N_DATA*0.67):N_DATA]
ytest = y[round(N_DATA*0.67):N_DATA]
xtrain = np.transpose(xtrain)
xtest = np.transpose(xtest)
model = Sequential()
model.add(Dense(10, activation='relu', input_dim=3))
model.add(Dense(5, activation='relu'))
model.add(Dense(1))
adam = keras.optimizers.Adam()
# configure learning process with .compile
model.compile(optimizer=adam, loss='mean_squared_error', metrics=['mse'])
# fit the model (iterate on the training data in batches)
history = model.fit(xtrain, ytrain, nb_epoch=50,
batch_size=round(N_DATA/100),
validation_data=(xtest, ytest), shuffle=False, verbose=2)
plt.plot(history.history['mean_squared_error'])
plt.plot(history.history['val_loss'])
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
This is essentially the same as your code and replicates the problem, which is not actually a problem. Simply change
history = model.fit(xtrain, ytrain, nb_epoch=50,
batch_size=round(N_DATA/100),
validation_data=(xtest, ytest), shuffle=False, verbose=2)
to
history = model.fit(xtrain, ytrain, nb_epoch=50,
batch_size=round(N_DATA/100),
validation_data=(xtrain, ytrain), shuffle=False, verbose=2)
So instead of validating with your validation data you validate using the training data again, which leads to exactly the same behavior. Weird isn't it? No actually not. What I think is happening is:
The initial mean_squared_error given by Keras on every epoch is the loss before the gradients have been applied, while the validation happens after the gradients have been applied, which makes sense.
With highly stochastic problems for which NNs are usually used you do not see that, because the data varies so much that the updated weights simply are not good enough to describe the validation data, the slight overfitting effect on the training data is still so much stronger that even after updating the weights the validation loss is still higher than the training loss from before. That is only how I think it is though, I might be completely wrong.
One of the reasons that I think is maybe you can increase the size of training data and lower the size of validation data. Then your model will be trained on more samples which may include some complex samples as well and then can be validated on the remaining samples. Try something like train-80% and Validation-20% or any other numbers a little higher than what you used previously.
If you don't want to change the size of training and validation sets, then you can try changing the random seed value to some other number so that you will get a training set with different samples which might be helpful in training the model well.
Check this answer here to get more understanding of the other possible reasons.
Check this link if you want a more detailed explanation with an example. #Michele
If training loss is a little higher or nearer to validation loss, it mean that model is not overfitting.
Efforts are always there to use best out of features to have less overfitting and better validation and test accuracies.
Probable reason that you are always getting train loss higher can be the features and data you are using to train.
Please refer following link and observe the training and validation loss in case of dropout:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
I have for some time gotten pretty bad results using the tool keras, and haven't been suspisous about the tool that much.. But I am beginning to be a bit concerned now.
I tried to see whether it could handle a simple XOR problem, and after 30000 epochs it still haven't solved it...
code:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
import numpy as np
np.random.seed(100)
model = Sequential()
model.add(Dense(2, input_dim=2))
model.add(Activation('tanh'))
model.add(Dense(1, input_dim=2))
model.add(Activation('sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X, y, nb_epoch=30000, batch_size=1,verbose=1)
print(model.predict_classes(X))
Here is part of my result:
4/4 [==============================] - 0s - loss: 0.3481
Epoch 29998/30000
4/4 [==============================] - 0s - loss: 0.3481
Epoch 29999/30000
4/4 [==============================] - 0s - loss: 0.3481
Epoch 30000/30000
4/4 [==============================] - 0s - loss: 0.3481
4/4 [==============================] - 0s
[[0]
[1]
[0]
[0]]
Is there something wrong with the tool - or am I doing something wrong??
Version I am using:
MacBook-Pro:~ usr$ python -c "import keras; print keras.__version__"
Using TensorFlow backend.
2.0.3
MacBook-Pro:~ usr$ python -c "import tensorflow as tf; print tf.__version__"
1.0.1
MacBook-Pro:~ usr$ python -c "import numpy as np; print np.__version__"
1.12.0
Updated version:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam, SGD
import numpy as np
#np.random.seed(100)
model = Sequential()
model.add(Dense(units = 2, input_dim=2, activation = 'relu'))
model.add(Dense(units = 1, activation = 'sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer='adam')
print model.summary()
model.fit(X, y, nb_epoch=5000, batch_size=4,verbose=1)
print(model.predict_classes(X))
I cannot add a comment to Daniel's response as I don't have enough reputation, but I believe he's on the right track. While I have not personally tried running the XOR with Keras, here's an article that might be interesting - it analyzes the various regions of local minima for a 2-2-1 network, showing that higher numerical precision would lead to fewer instances of getting stuck on a gradient descent algorithm.
The Local Minima of the Error Surface of the 2-2-1 XOR Network (Ida G. Sprinkhuizen-Kuyper and Egbert J.W. Boers)
On a side note I won't consider using a 2-4-1 network as over-fitting the problem. Having 4 linear cuts on the 0-1 plane (cutting into a 2x2 grid) instead of 2 cuts (cutting the corners off diagonally) just separates the data in a different way, but since we only have 4 data points and no noise in the data, the neural network that uses 4 linear cuts isn't describing "noise" instead of the XOR relationship.
I think it's a "local minimum" in the loss function.
Why?
I have run this same code over and over for a few times, and sometimes it goes right, sometimes it gets stuck into a wrong result. Notice that this code "recreates" the model every time I run it. (If I insist on training a model that found the wrong results, it will simply be kept there forever).
from keras.models import Sequential
from keras.layers import *
import numpy as np
m = Sequential()
m.add(Dense(2,input_dim=2, activation='tanh'))
#m.add(Activation('tanh'))
m.add(Dense(1,activation='sigmoid'))
#m.add(Activation('sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]],'float32')
Y = np.array([[0],[1],[1],[0]],'float32')
m.compile(optimizer='adam',loss='binary_crossentropy')
m.fit(X,Y,batch_size=1,epochs=20000,verbose=0)
print(m.predict(X))
Running this code, I have found some different outputs:
Wrong: [[ 0.00392423], [ 0.99576807], [ 0.50008368], [ 0.50008368]]
Right: [[ 0.08072935], [ 0.95266515], [ 0.95266813], [ 0.09427474]]
What conclusion can we take from it?
The optimizer is not dealing properly with this local minimum. If it gets lucky (a proper weight initialization), it will fall in a good minimum, and bring the right results.
If it gets unlucky (a bad weight initialization), it will fall in a local minimum, without really knowing that there are better places in the loss function, and its learn_rate is simply not big enough to escape this minimum. The small gradient keeps turning around the same point.
If you take the time to study which gradients appear in the wrong case, you will probably see it keeps pointing towards that same point, and increasing the learning rate a little may make it escape the hole.
Intuition makes me think that such very small models have more prominent local minimums.
Instead of just increasing the number of epochs, try using relu for the activation of your hidden layer instead of tanh. Making only that change to the code you provide, I am able to obtain the following result after only 2000 epochs (Theano backend):
import numpy as np
print(np.__version__) #1.11.3
import keras
print(theano.__version__) # 0.9.0
import theano
print(keras.__version__) # 2.0.2
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam, SGD
np.random.seed(100)
model = Sequential()
model.add(Dense(units = 2, input_dim=2, activation = 'relu'))
model.add(Dense(units = 1, activation = 'sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer='adam'
model.fit(X, y, epochs=2000, batch_size=1,verbose=0)
print(model.evaluate(X,y))
print(model.predict_classes(X))
4/4 [==============================] - 0s
0.118175707757
4/4 [==============================] - 0s
[[0]
[1]
[1]
[0]]
It would be easy to conclude that this is due to vanishing gradient problem. However, the simplicity of this network suggest that this isn't the case. Indeed, if I change the optimizer from 'adam' to SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False) (the default values), I can see the following result after 5000 epochs with tanh activation in the hidden layer.
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam, SGD
np.random.seed(100)
model = Sequential()
model.add(Dense(units = 2, input_dim=2, activation = 'tanh'))
model.add(Dense(units = 1, activation = 'sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer=SGD())
model.fit(X, y, epochs=5000, batch_size=1,verbose=0)
print(model.evaluate(X,y))
print(model.predict_classes(X))
4/4 [==============================] - 0s
0.0314897596836
4/4 [==============================] - 0s
[[0]
[1]
[1]
[0]]
Edit: 5/17/17 - Included complete code to enable reproduction
The minimal neuron network architecture required to learn XOR which should be a (2,2,1) network. In fact, if maths shows that the (2,2,1) network can solve the XOR problem, but maths doesn't show that the (2,2,1) network is easy to train. It could sometimes takes a lot of epochs (iterations) or does not converge to the global minimum. That said, I've got easily good results with (2,3,1) or (2,4,1) network architectures.
The problem seems to be related to the existence of many local minima. Look at this 1998 paper, «Learning XOR: exploring the space of a classic problem» by Richard Bland. Furthermore weights initialization with random number between 0.5 and 1.0 helps to converge.
It works fine with Keras or TensorFlow using loss function 'mean_squared_error', sigmoid activation and Adam optimizer. Even with pretty good hyperparameters, I observed that the learned XOR model is trapped in a local minimum about 15% of the time.
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from tensorflow.keras import initializers
import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
def initialize_weights(shape, dtype=None):
return np.random.normal(loc = 0.75, scale = 1e-2, size = shape)
model = Sequential()
model.add(Dense(2,
activation='sigmoid',
kernel_initializer=initialize_weights,
input_dim=2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
print("*** Training... ***")
model.fit(X, y, batch_size=4, epochs=10000, verbose=0)
print("*** Training done! ***")
print("*** Model prediction on [[0,0],[0,1],[1,0],[1,1]] ***")
print(model.predict_proba(X))
*** Training... ***
*** Training done! ***
*** Model prediction on [[0,0],[0,1],[1,0],[1,1]] ***
[[0.08662204]
[0.9235283 ]
[0.92356336]
[0.06672956]]