I am unsure how to interpret the default behavior of Keras in the following situation:
My Y (ground truth) was set up using scikit-learn's MultilabelBinarizer().
Therefore, to give a random example, one row of my y column is one-hot encoded as such:
[0,0,0,1,0,1,0,0,0,0,1].
So I have 11 classes that could be predicted, and more than one can be true; hence the multilabel nature of the problem. There are three labels for this particular sample.
I train the model as I would for a non multilabel problem (business as usual) and I get no errors.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy',])
model.fit(X_train, y_train,epochs=5,batch_size=2000)
score = model.evaluate(X_test, y_test, batch_size=2000)
score
What does Keras do when it encounters my y_train and sees that it is "multi" one-hot encoded, meaning there is more than one 'one' present in each row of y_train? Basically, does Keras automatically perform multilabel classification? Any differences in the interpretation of the scoring metrics?
In short
Don't use softmax.
Use sigmoid for activation of your output layer.
Use binary_crossentropy for loss function.
Use predict for evaluation.
Why
In softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels.
Complete Code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd)
model.fit(X_train, y_train, epochs=5, batch_size=2000)
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
# score = compare preds and y_test
Answer from Keras Documentation
I am quoting from keras document itself.
They have used output layer as dense layer with sigmoid activation. Means they also treat multi-label classification as multi-binary classification with binary cross entropy loss
Following is model created in Keras documentation
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
] # More on why "sigmoid" has been used here in a moment.
Keras doc link::
https://keras.io/examples/nlp/multi_label_classification/
Related
I have trained a LSTM model to predict multiple output value.
Predicted values are almost same even though the loss is less. Why is it so? How can I improve it?
`from keras import backend as K
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.layers.core import Dense, Dropout, Activation
def create_model():
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(40000, 7)))
model.add(LSTM(50, return_sequences= True))
model.add(LSTM(50, return_sequences= False))
model.add(Dense(25))
model.add(Dense(2, activation='linear'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.summary()
return model
model = create_model()
model.fit(X_train, Y_train, shuffle=False, verbose=1, epochs=10)
prediction = model.predict(X_test, verbose=0)
print(prediction)
prediction =
[[0.26766795 0.00193274]
[0.2676593 0.00192017]
[0.2676627 0.00193239]
[0.2676644 0.00192784]
[0.26766634 0.00193461]
[0.2676624 0.00192487]
[0.26766685 0.00193129]
[0.26766685 0.00193165]
[0.2676621 0.00193216]
[0.26766127 0.00192624]]
`
calculate mean_relative error
`mean_relative_error = tf.reduce_mean(tf.abs((Y_test-prediction)/Y_test))
print(mean_relative_error)`
`mean_relative_error= 1.9220362`
It means you are just closing the values of x as nearest to y. Just like mapping x -> y. The Relative Error is saying to me that your y's are relatively small and when you are taking the mean difference between y_hat and y they are close enough...
To Break this symmetry you should increase the number of LSTM Cells and add a Dropout to it, also make sure to put an L1-Regularization term into your Dense Layers.
Decrease the number of neurons from each Dense Layer and increase the network size, also change your loss from "mean_squared_error" to "mean_absolute_error".
One more thing use Adagrad with a learning_rate of 1, instead of Adam Optimizer.
I have this code:
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:,0:8]
y = dataset[:,8]
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=150, batch_size=10)
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
I need to change the output column so it predicts/learns from a score(for instance 1 to a million) instead of 0 or 1(sigmoid).
As for your case you need to use relu as your activation function in the last layer (output layer) instead of sigmoid
The range of relu is [0,inf).Then in that case you need to use 'MSE' as your loss metric.
Conceptually, the problem which you are trying to solve is a regression type of problem.
I am unsure how to interpret the default behavior of Keras in the following situation:
My Y (ground truth) was set up using scikit-learn's MultilabelBinarizer().
Therefore, to give a random example, one row of my y column is one-hot encoded as such:
[0,0,0,1,0,1,0,0,0,0,1].
So I have 11 classes that could be predicted, and more than one can be true; hence the multilabel nature of the problem. There are three labels for this particular sample.
I train the model as I would for a non multilabel problem (business as usual) and I get no errors.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy',])
model.fit(X_train, y_train,epochs=5,batch_size=2000)
score = model.evaluate(X_test, y_test, batch_size=2000)
score
What does Keras do when it encounters my y_train and sees that it is "multi" one-hot encoded, meaning there is more than one 'one' present in each row of y_train? Basically, does Keras automatically perform multilabel classification? Any differences in the interpretation of the scoring metrics?
In short
Don't use softmax.
Use sigmoid for activation of your output layer.
Use binary_crossentropy for loss function.
Use predict for evaluation.
Why
In softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels.
Complete Code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd)
model.fit(X_train, y_train, epochs=5, batch_size=2000)
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
# score = compare preds and y_test
Answer from Keras Documentation
I am quoting from keras document itself.
They have used output layer as dense layer with sigmoid activation. Means they also treat multi-label classification as multi-binary classification with binary cross entropy loss
Following is model created in Keras documentation
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
] # More on why "sigmoid" has been used here in a moment.
Keras doc link::
https://keras.io/examples/nlp/multi_label_classification/
I am trying to approximate function with keras model, that has only one hidden layer and whatever I do - I can't reach necessary result.
I'm trying to do it with following code
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from LABS.ZeroLab import E_Function as dataset5
train_size = 2000
# 2 model and data initializing
(x_train, y_train), (x_test, y_test) = dataset5.load_data(train_size=train_size, show=True)
model = Sequential()
model.add(Dense(50, kernel_initializer='he_uniform', bias_initializer='he_uniform', activation='sigmoid'))
model.add(Dense(1, kernel_initializer='he_uniform', bias_initializer='he_uniform', activation='linear'))
model.compile(optimizer=Adam(), loss='mae', metrics=['mae'])
history = model.fit(x=x_train, y=y_train, batch_size=20, epochs=10000, validation_data=(x_test, y_test), verbose=1)
It's function that loads from dataset5
It's comparison of model prediction with testing data
I tryied to fit this network with different optimizers and neurons number (from 50 to 300), but result was the same.
What should I change?
I found solution! The main issue was train data. I forgot to shuffle x_train and y_train before fitting.
I approximated it with 2 hidden layers succesfully, but I still can't to approximate it with 1 hidden layer.
We are trying to build a keras model to predict a vector with probablity rates from a vector of features. The output vector should be of probabilty rates which are between 0 and one and to sum to 1, but some how the output vector consists mostly of zeros and ones, moreover during the time which the model should be training and learn loss and val_loss rates remains the same.
Does anyone knows what is the problem with our model?
example of input vector:
(0,4,1444997,0,622,154536,0,2,11,0,5,11,10,32,4.26E-04,0,5,498,11,1,11,0,172,0,4,1,8,150)
example of expected output vector:
(0.25,0,0,0.083333333,0.583333333,0.083333333)
example of real output vector:
(1.000000000000000000e+00,5.556597260531319618e-28,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00)
the code:
# Create first network with Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.advanced_activations import LeakyReLU
from keras import optimizers
import numpy
X = numpy.loadtxt("compiledFeatures.csv", delimiter=",")
Y = numpy.loadtxt("naive_compiledDate.csv", delimiter=",")
# create model
model = Sequential()
model.add(Dense(20, input_dim=28, init='normal', activation='relu'))
model.add(Dense(15, init='normal', activation='relu'))
model.add(Dense(6, init='normal', activation='relu'))
model.add(Dense(6, init='normal', activation='sigmoid'))
# Compile model
model.compile(optimizer = "adam", loss = 'mae')
# Fit the model
model.fit(X, Y, epochs=2000, verbose=2, validation_split = 0.15)
# calculate predictions
predictions = model.predict(X)
The last activation function to guarantee that the sum is 1 is "softmax".
Now, a frozen loss may be caused by "relu" in this case where you have so few neurons in each layer. (Also a improper weight initialization)
I suggest instead of relu you use "softplus", "tanh" or even "sigmoid".
EDIT:
As #nuric suggested, it's really a good idea to use "categorical_crossentropy" as loss when you're using "softmax".