We are trying to build a keras model to predict a vector with probablity rates from a vector of features. The output vector should be of probabilty rates which are between 0 and one and to sum to 1, but some how the output vector consists mostly of zeros and ones, moreover during the time which the model should be training and learn loss and val_loss rates remains the same.
Does anyone knows what is the problem with our model?
example of input vector:
(0,4,1444997,0,622,154536,0,2,11,0,5,11,10,32,4.26E-04,0,5,498,11,1,11,0,172,0,4,1,8,150)
example of expected output vector:
(0.25,0,0,0.083333333,0.583333333,0.083333333)
example of real output vector:
(1.000000000000000000e+00,5.556597260531319618e-28,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00)
the code:
# Create first network with Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.advanced_activations import LeakyReLU
from keras import optimizers
import numpy
X = numpy.loadtxt("compiledFeatures.csv", delimiter=",")
Y = numpy.loadtxt("naive_compiledDate.csv", delimiter=",")
# create model
model = Sequential()
model.add(Dense(20, input_dim=28, init='normal', activation='relu'))
model.add(Dense(15, init='normal', activation='relu'))
model.add(Dense(6, init='normal', activation='relu'))
model.add(Dense(6, init='normal', activation='sigmoid'))
# Compile model
model.compile(optimizer = "adam", loss = 'mae')
# Fit the model
model.fit(X, Y, epochs=2000, verbose=2, validation_split = 0.15)
# calculate predictions
predictions = model.predict(X)
The last activation function to guarantee that the sum is 1 is "softmax".
Now, a frozen loss may be caused by "relu" in this case where you have so few neurons in each layer. (Also a improper weight initialization)
I suggest instead of relu you use "softplus", "tanh" or even "sigmoid".
EDIT:
As #nuric suggested, it's really a good idea to use "categorical_crossentropy" as loss when you're using "softmax".
Related
I have trained a LSTM model to predict multiple output value.
Predicted values are almost same even though the loss is less. Why is it so? How can I improve it?
`from keras import backend as K
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.layers.core import Dense, Dropout, Activation
def create_model():
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(40000, 7)))
model.add(LSTM(50, return_sequences= True))
model.add(LSTM(50, return_sequences= False))
model.add(Dense(25))
model.add(Dense(2, activation='linear'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.summary()
return model
model = create_model()
model.fit(X_train, Y_train, shuffle=False, verbose=1, epochs=10)
prediction = model.predict(X_test, verbose=0)
print(prediction)
prediction =
[[0.26766795 0.00193274]
[0.2676593 0.00192017]
[0.2676627 0.00193239]
[0.2676644 0.00192784]
[0.26766634 0.00193461]
[0.2676624 0.00192487]
[0.26766685 0.00193129]
[0.26766685 0.00193165]
[0.2676621 0.00193216]
[0.26766127 0.00192624]]
`
calculate mean_relative error
`mean_relative_error = tf.reduce_mean(tf.abs((Y_test-prediction)/Y_test))
print(mean_relative_error)`
`mean_relative_error= 1.9220362`
It means you are just closing the values of x as nearest to y. Just like mapping x -> y. The Relative Error is saying to me that your y's are relatively small and when you are taking the mean difference between y_hat and y they are close enough...
To Break this symmetry you should increase the number of LSTM Cells and add a Dropout to it, also make sure to put an L1-Regularization term into your Dense Layers.
Decrease the number of neurons from each Dense Layer and increase the network size, also change your loss from "mean_squared_error" to "mean_absolute_error".
One more thing use Adagrad with a learning_rate of 1, instead of Adam Optimizer.
I am unsure how to interpret the default behavior of Keras in the following situation:
My Y (ground truth) was set up using scikit-learn's MultilabelBinarizer().
Therefore, to give a random example, one row of my y column is one-hot encoded as such:
[0,0,0,1,0,1,0,0,0,0,1].
So I have 11 classes that could be predicted, and more than one can be true; hence the multilabel nature of the problem. There are three labels for this particular sample.
I train the model as I would for a non multilabel problem (business as usual) and I get no errors.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy',])
model.fit(X_train, y_train,epochs=5,batch_size=2000)
score = model.evaluate(X_test, y_test, batch_size=2000)
score
What does Keras do when it encounters my y_train and sees that it is "multi" one-hot encoded, meaning there is more than one 'one' present in each row of y_train? Basically, does Keras automatically perform multilabel classification? Any differences in the interpretation of the scoring metrics?
In short
Don't use softmax.
Use sigmoid for activation of your output layer.
Use binary_crossentropy for loss function.
Use predict for evaluation.
Why
In softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels.
Complete Code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd)
model.fit(X_train, y_train, epochs=5, batch_size=2000)
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
# score = compare preds and y_test
Answer from Keras Documentation
I am quoting from keras document itself.
They have used output layer as dense layer with sigmoid activation. Means they also treat multi-label classification as multi-binary classification with binary cross entropy loss
Following is model created in Keras documentation
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
] # More on why "sigmoid" has been used here in a moment.
Keras doc link::
https://keras.io/examples/nlp/multi_label_classification/
I've made a multilayer LSTM model that uses regression to predict next frame's values of the data. The model finishes after 20 epochs. I then get some predictions and compare them to my ground truth values. As you can see them in the picture above, predictions converge to a constant value. I don't know why this happens.
Here is my model so far:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import LSTM, BatchNormalization
from tensorflow.python.keras.initializers import RandomUniform
init = RandomUniform(minval=-0.05, maxval= 0.05)
model = Sequential()
model.add(LSTM(kernel_initializer=init, activation='relu', return_sequences=True, units=800, dropout=0.5, recurrent_dropout=0.2, input_shape=(x_train.shape[1], x_train.shape[2]) ))
model.add(LSTM(kernel_initializer=init, activation='relu', return_sequences=False, units=500, dropout=0.5, recurrent_dropout=0.2 ))
model.add(Dense(1024, activation='linear', kernel_initializer=init))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear', kernel_initializer= 'normal'))
model.compile(loss='mean_squared_error', optimizer='rmsprop' )
model.summary()
EDIT1:
I decreased epochs from 20 to 3. results are as follows:
By comparing 2 pictures, I can conclude that when the number of epochs increases, the predictions are more likely to converge to some specific value which is around -0.1.
So, after trying different number of LSTM units and different types of architectures, I realized that the current number of LSTM units causes the model to learns so slowly and 20 epochs were not sufficient for such huge model.For each layer, I changed the number of LSTM units to 64 and also removed Dense(1024)layer and increased the number of epochs from 20 to 400 and results were incredibly close to the ground truth values. I should mention that the dataset used in the new model was different from the former one because I encountered some problems with that dataset . here is the new model:
from keras.optimizers import RMSprop
from keras.initializers import glorot_uniform, glorot_normal, RandomUniform
init = glorot_normal(seed=None)
init1 = RandomUniform(minval=-0.05, maxval=0.05)
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)
model = Sequential()
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2,
input_shape=(x_train.shape[1], x_train.shape[2]),
return_sequences=True, kernel_initializer=init))
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2,
return_sequences=False, kernel_initializer=init))
model.add(Dense(1, activation='linear', kernel_initializer= init1))
model.compile(loss='mean_squared_error', optimizer=optimizer )
model.summary()
you can see the predictions here:
It's still not the best model, but at least outperformed the former one.
If you have any further recommendation on how to improve it, it'll be greatly appreciated.
I try to create a neural network with keras (backened tensorflow).
I have 4 Input and 2 Output variables:
not available
I want to do predictions to a Testset not available.
This is my Code:
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense
import numpy
numpy.random.seed(7)
dataset = numpy.loadtxt("trainingsdata.csv", delimiter=";")
X = dataset[:,0:4]
Y = dataset[:,4:6]
model = Sequential()
model.add(Dense(4, input_dim=4, init='uniform', activation='sigmoid'))
model.add(Dense(3, init='uniform', activation='sigmoid'))
model.add(Dense(2, init='uniform', activation='linear'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
model.fit(X, Y, epochs=150, batch_size=10, verbose=2)
testset = numpy.loadtxt("testdata.csv", delimiter=";")
Z = testset[:,0:4]
predictions = model.predict(Z)
print(predictions)
When I run the script, the accuracy is 1.000 after every epoch and I get as result always the same output for every input pair:
[-5.83297 68.2967]
[-5.83297 68.2967]
[-5.83297 68.2967]
...
Has anybody an idea what the fault in my code is?
I suggest you normalize / standardize your data before feeding it to your model and then check if your model starts to learn.
Have a look at scikit-learn's StandardScaler.
And look into this SO thread to learn how to correctly fit_transform your training data and only transform your test data.
There is also this tutorial that makes use of scikit-learn's data preprocessing pipeline: http://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
Neural networks have a tough time if the scale of the input variables is too different from each other. Having 10, 1000, 100000 as the same inputs causes the gradients to collapse towards whatever the large value is. The other values effectively don't provide any information.
One method is to simply rescale the input variables by a constant. You can simply divide the 206000 by 100000. Try getting all of the variables to be at around the same number of digits. Large numbers are a bit harder than small numbers, for networks.
I am unsure how to interpret the default behavior of Keras in the following situation:
My Y (ground truth) was set up using scikit-learn's MultilabelBinarizer().
Therefore, to give a random example, one row of my y column is one-hot encoded as such:
[0,0,0,1,0,1,0,0,0,0,1].
So I have 11 classes that could be predicted, and more than one can be true; hence the multilabel nature of the problem. There are three labels for this particular sample.
I train the model as I would for a non multilabel problem (business as usual) and I get no errors.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy',])
model.fit(X_train, y_train,epochs=5,batch_size=2000)
score = model.evaluate(X_test, y_test, batch_size=2000)
score
What does Keras do when it encounters my y_train and sees that it is "multi" one-hot encoded, meaning there is more than one 'one' present in each row of y_train? Basically, does Keras automatically perform multilabel classification? Any differences in the interpretation of the scoring metrics?
In short
Don't use softmax.
Use sigmoid for activation of your output layer.
Use binary_crossentropy for loss function.
Use predict for evaluation.
Why
In softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels.
Complete Code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd)
model.fit(X_train, y_train, epochs=5, batch_size=2000)
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
# score = compare preds and y_test
Answer from Keras Documentation
I am quoting from keras document itself.
They have used output layer as dense layer with sigmoid activation. Means they also treat multi-label classification as multi-binary classification with binary cross entropy loss
Following is model created in Keras documentation
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
] # More on why "sigmoid" has been used here in a moment.
Keras doc link::
https://keras.io/examples/nlp/multi_label_classification/