Using softmax as output function while using binary_crossentropy as loss function? - python

Currently I am training a model for binary classification. I liked the idea of having two probabilities (one for each of the existing classes) which add up to 1. So I used softmax in my output layer and have gotten very high accuracies (up to 99,5%) with also very low losses of 0,007.
While researching a bit I found that binary crossentropy is the only real choice when training for a 2 dimensional classification problem.
Now I am getting confused if I have to use a classification_crossentropy as lossfunction when I want to use softmax. Could you help me to understand what should be used as loss function and activation function in a binary classification problem and why?
Heres my code:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, input_dim=input_dim, activation='sigmoid'))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.add(tf.keras.layers.Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

So, if every object can represent only one class then there is no difference between
model.add(Dense(1, activation='sigmoid'))
loss = tf.keras.losses.BinaryCrossentropy()
and
model.add(Dense(2, activation='softmax'))
loss = tf.keras.losses.CategoricalCrossentropy()
As mentioned here, binary crossentropy is just a case of categorical crossentropy.

The loss function is depending on the problem type.
For a binary classification problem -> binary_crossentropy
For a multi-class problem -> categoricol_crossentropy
For a text classification problem -> MSE loss is calculated.
The activation function is also depending on the problem type.
Generally, relu activation function is used, but for a binary classification problem sometimes tanh performs better.
I wouldn't suggest using sigmoid
For optimizer, generally, Adadelta performs better.
The reason for the suggestion is the accuracy metric. The aim is to reach high accuracy, therefore your model must be learning. There are no strict rules, but some methods have been proven to perform better.

Related

How To specify model.compile for binary_crossentropy, activation=sigmoid and activation=softmax?

I am trying to figure out how to match activation=sigmoid and activation=softmax with the correct model.compile() loss parameters. Specifically those associated with binary_crossentropy.
I have researched related topics and read the docs. Also I have built a model and got it working with sigmoid but not softmax. And I cannot get it working properly with the "from_logits" parameters.
Specifically, here it says:
Args:
from_logits: Whether output is expected to be a logits tensor.
By default, we consider that output encodes a probability distribution.
This says to me that if you use a sigmoid activation you want "from_logits=True". And for softmax activation you want "from_logits=False" by default. Here I am assuming that sigmoid provides logits and softmax provides a probability distribution.
Next is some code:
model = Sequential()
model.add(LSTM(units=128,
input_shape=(n_timesteps, n_features),
return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=64, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=32))
model.add(Dropout(0.3))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
Notice the last line is using the sigmoid activation. Then:
model.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['accuracy'])
This works fine but it is working with the default "from_logits=False" which is expecting a probability distribution.
If I do the following, it fails:
model.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['accuracy'],
from_logits=True) # For 'sigmoid' in above Dense
with this error message:
ValueError: Invalid argument "from_logits" passed to K.function with TensorFlow backend
If I try using the softmax activation as:
model.add(Dense(1, activation='softmax'))
It runs but I get 50% accuracy results. With sigmoid I am getting +99% accuracy. (I am using a very contrived data set to debug my models and would expect very high accuracy. Plus it is a very small data set and will over fit but that is OK for now.)
So I expect that I should be able to use the "from_logits" parameter in the compile function. But it does not recognize that parameter.
Also I would like to know why it works with the sigmoid activation and not the softmax activation and how do I get it working with the softmax activation.
Thank you,
Jon.
To use the from_logits in your loss function, you must pass it into the BinaryCrossentropy object initialization, not in the model compile.
You must change this:
model.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['accuracy'],
from_logits=True)
to this:
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
However, if you are using a softmax or sigmoid in the final layer in the network, you do not need from_logits=True. Softmax and sigmoid output normalized values between [0, 1], which are considered probabilities in this context.
See this question for more information: What is the meaning of the word logits in TensorFlow?
Now to fix your 50% accuracy issue with softmax, change the following code from this:
model.add(Dense(1, activation='softmax'))
to this:
model.add(Dense(2, activation='softmax')) # number of units = number of classes
Remember that when you are using softmax, you are outputting the probability of the example belonging to each class. For this reason, you need a unit for each possible class, which in a binary classification context will be 2 units.

Cross validation with CNN

I would like to know if my code is doing what i want to do; To give you some background 'im implementing CNN for image classification. I'm trying to use cross validation to compare my different neural network architecture
here the code:
def create_model():
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(12, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
return model
model = KerasClassifier(build_fn=create_model, epochs=5, batch_size=20, verbose=1)
# 3-Fold Crossvalidation
kfold = KFold(n_splits=3, shuffle=True, random_state=2019)
results = cross_val_score(model, train_X, train_Y_one_hot, cv=kfold)
model.fit(train_X, train_Y_one_hot,validation_data=(valid_X, valid_label),class_weight=class_weights)
y_pred = model.predict(test_X)
test_eval = model.evaluate(test_X, y_pred, verbose=0)
I have found the part for cross validation on internet. But i have some problem to understand it.
My question: 1=> Can I use cross validation to improve my accuracy? For example i run 10 time my neural network and my model get the weight where the best accuracy occured
2 => If i understand well, in the code above, results run my CNN 3 time and show me the accuracy. But when i use model.fit, model is run only one time; Am i right?
Thanks for your help
Not really, cross-validation is more a way to prevent overfitting/ not be confused by abnormal results coming from a badly splitted dataset -> getting a revelant estimation of you model performances. If you want to tune the Hyperparameters of your model, you should better use sklearn.model_selection.GridSearchCV / sklearn.model_selection.RandomSearchCV
when doing cross_val_score For each Train/Test
sklearn does a fit then predict/evaluate, So for each new Instance of the model,
you have 1 fit then 1 predict/evaluate;
Else your cross-validation is not valid because it depends on fitting on previous dataset (and maybe on test data !)
There are two key terms here that you should get familiarized with:
Hyperparameters
Parameters
Hyperparameters control the general architecture of a model. These are what the programmer or data scientist controls. In case of a CNN, this refers to the number of layers, their configurations, activations, optimizers etc. For a simple polynomial regression model this would be the degree of the polynomial.
Parameters refer to the actual values of weights or coefficients that the model ends up with after it solves the optimization using gradient descent or whatever method you use. In a CNN this would be the weights matrix for each layer. For a polynomial regression this would be the coefficients and bias.
Cross validation is used to find the best set of hyperparameters. The best set of parameters are obtained by the optimizer (gradient descent, adam etc) for a given set of hyperparameters and data.
To answer your questions:
You would run cross validation several times, each time with a different hyperparameter configuration (network architecture). That's the only thing you can control. At the end you pick the best architecture based on accuracy. The weights of the model would be different for each fold but finding the best weights is the optimizer's job, not yours.
Yes. In 3 fold CV, the model is trained 3 times and evaluated 3 times. When you do model.fit you are making predictions once on a new dataset.

Why is binary_crossentropy performing better than categorical_crossentropy for multiclass classification in Keras?

I've seen many similar issues in stackoverflow but none of this refer to my case.
I have a multiclass classification problem and my labels are mutually exclusive.
Training with a binary_crossentropy due to a typo, resulted in lower loss and higher accuracy. What's interesting here is that, unlike other issues in stackoverflow, I am printing the "categorical_accuracy" of Keras. My labels are one-hot encoded.
So, to be exact my code looks like that:
net = Sequential()
net.add(TimeDistributed(model_A, input_shape=(timesteps,960, 75, 1)))
net.add(LSTM(100))
net.add(Dropout(0.5))
net.add(Dense(100, activation='relu'))
net.add(Dense(len(labels), activation='softmax'))
net.compile(loss='binary_crossentropy', optimizer=adam_opt, metrics=['binary_accuracy', 'categorical_accuracy'])
I also tried to train with "categorical_crossentropy", when I noticed the typo and the results where worse. How can this be explained?

Keras - what accuracy metric should be used along with sparse_categorical_crossentropy to compile model

When I have 2 classes I used binary_crossentropy as loss value like this to compile a model:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
But right now I have 5 classes & I'm not using on hot encoded features. So I choose sparse_categorical_crossentropy as loss value. But what should be the accuracy metric as keras metric source code suggested there are multiple accuracy metrics available. I tried:
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'])
So is it correct or should I just use categorical_accuracy?
sparse_categorical_accuracy is a correct metrics for
sparse_categorical_entropy.
But why are you using sparse_categorical_entropy? What kind of classes do you have? sparse_categorical_entropy is being used for Integer outputs. But if you have a one-hot-encoded target, you should use categorical_crossentropy as loss function and accuracy or categorical_accuracy for metrics.
UPDATE:
Use the following code for your classification problem:
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Accessing gradient values of keras model outputs with respect to inputs

I made a pretty simple NN model to do some non-linear regressions for me in Keras, as an introduction exercise. I uploaded my jupyter notebookit as a gist here (renders properly on github), which is pretty short and to the point.
It just fits the 1D function y = (x - 5)^2 / 25.
I know that Theano and Tensorflow are, at their core, graph based derivative (gradient) passing frameworks. And utilizing the gradients of loss functions with respect to weights for gradient step-based optimization are the main purpose of that.
But what I'm trying to get sense of is if I have access to something that, given a trained model, can approximate derivatives of inputs with respect to the output layer for me (not the weights or loss function). So for this case, I would want y' = 2(x-5)/25.0 estimated via the network's derivative graph for me for an indicated value of the input x, in the network's currently trained state.
Do I have any options in either the Keras or Theano/TF backend APIs to do this, or do I need to do my own chain ruling somehow with the weights (or maybe adding my own non-trainable "identity" layers or something)? In my notebook, you can see me trying a few approaches based what I was able to find so far, but without a ton of success.
To make it concrete, I have a working keras model with the structure:
model = Sequential()
# 1d input
model.add(Dense(64, input_dim=1, activation='relu'))
model.add(Activation("linear"))
model.add(Dense(32, activation='relu'))
model.add(Activation("linear"))
model.add(Dense(32, activation='relu'))
# 1d output
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam', metrics=["accuracy"])
model.fit(x, y,
batch_size=10,
epochs=25,
verbose=0,
validation_data=(x_test, y_test))
I would like to estimate the derivative of output y with respect to input x at, say, x = 0.5.
All of my attempts to extract gradient values based on searching for past answers have led to syntax errors. From a high level point of view, is this a supported feature of Keras, or is any solution going to be backend-specific?
As you mention, Theano and TF are symbolic, so doing a derivative should be quite easy:
import theano
import theano.tensor as T
import keras.backend as K
J = T.grad(model.output[0, 0], model.input)
jacobian = K.function([model.input, K.learning_phase()], [J])
First you compute the symbolic gradient (T.grad) of the output given the input, then you build a function that you can call and does the computation. Note that sometimes this is not that trivial due to shape problems, as you get one derivative for each element in the input.

Categories

Resources