I am trying to better understand the decision boundary of a binary classifier by getting explicit set of points on the decision boundary. The approach I am taking is as follows:
I have a simple feedforward neural network with softmax on the final activation layer trained on one-hot encoded data. It returns a vector (1,0) for the first class and (0,1) for the second class.
model1 = keras.Sequential([
layers.Flatten(input_shape=(2,)),
layers.Dense(20, activation='relu'),
layers.Dense(20, activation='relu'),
layers.Dense(20, activation='relu'),
layers.Dense(2, activation='softmax')
])
If the model is very confident that the datapoint is in the first class the vector, before applying the softmax function, will look like (0.9, 0.1). If it is less certain, then the vector before the softmax function will look like (0.6, 0.4). If it is a point on the decision boundary then the vector will look like (0.5, 0.5). If it is certain that our vector is in the second class it would look something like (0.1,0.9) before softmax activation.
In the different cases we can evaluate the difference for a vector (x1,x2) as x1-x2. In the first case this gives us 0.9-0.1=0.8, in the second 0.6-0.4=0.2, in the third 0.5-0.5=0, in the fourth 0.1-0.9=-0.8.
The observation here is that the set of parameters when this vector is 0 defines the decision boundary of our machine learning problem which I am trying to understand better.
Is there a way to extract the output of the machine learning model before it passes through the softmax layer to evaluate this difference?
If the above is not possible, can I use transfer learning to achieve the same goal? In particular can I train model1 as above and construct a model2
model1 = keras.Sequential([
...
layers.Dense(2)
])
and transfer the weights from model1 to model2 to evaluate this difference?
Related
I've been trying to build an image classifier with CNN. There are 2300 images in my dataset and two categories: men and women. Here's the model I used:
early_stopping = EarlyStopping(min_delta = 0.001, patience = 30, restore_best_weights = True)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:], activation = 'relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:], activation = 'relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Flatten()) # this converts our 3D feature maps to 1D feature vectors
model.add(tf.keras.layers.Dense(64))
model.add(tf.keras.layers.Dense(1, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
h= model.fit(xtrain, ytrain, validation_data=(xval, yval), batch_size=32, epochs=30, callbacks = [early_stopping], verbose = 0)
Accuracy of this model is 0.501897 and loss 7.595693(the model is stuck on these numbers in every epoch) but if I replace Softmax activation with Sigmoid, accuracy is about 0.98 and loss 0.06. Why does such strange thing happen with Softmax? All info I could find was that these two activations are similar and softmax is even better but I couldn't find anything about such abnormality. I'll be glad if someone could explain what the problem is.
Summary of your results:
a) CNN with Softmax activation function -> accuracy ~ 0.50, loss ~ 7.60
b) CNN with Sigmoid activation function -> accuracy ~ 0.98, loss ~ 0.06
TLDR
Update:
Now that I also see you are using only 1 output neuron with Softmax, you will not be able to capture the second class in binary classification. With Softmax you need to define K neurons in the output layer - where K is the number of classes you want to predict. Whereas with Sigmoid: 1 output neuron is sufficient for binary classification.
so in short, this should change in your code when using softmax for 2 classes:
#use 2 neurons with softmax
model.add(tf.keras.layers.Dense(2, activation='softmax'))
Additionally:
When doing binary classification, a sigmoid function is more suitable as it is simply computationally more effective compared to the more generalized softmax function (which is normally being used for multi-class prediction when you have K>2 classes).
Further Reading:
Some attributes of selected activation functions
If the short answer above is not enough for you, I can share with you some things I've learned from my research about activation functions with NNs in short:
To begin with, let's be clear with the terms activation and activation function
activation (alpha): is the state of a neuron. The state of neurons in hidden or output layers will be quantified by the weighted sum of input signals from a previous layer
activation function f(alpha): Is a function that transforms an activation to a neuron signal. Usually a non-linear and differentiable function as for instance the sigmoid function. Many applications & research has been applied with the sigmoid function (see Bengio & Courville, 2016, p.67 ff.). Mostly the same activation function is being used throughout the neural network, but it is possible to use multiple (e.g. different ones in different layers).
Now to the effects of activation functions:
The choice of activation function can have an immense impact on learning of neural networks (as you have seen in your example). Historically it was common to use the sigmoid function, as it was a good function to depict a saturated neuron. Today, especially in CNNs other activation functions, also only partially linear activation functions (like relu) is being preferred over sigmoid function. There are many different functions, just to name some: sigmoid, tanh, relu, prelu, elu ,maxout, max, argmax, softmax etc.
Now let's only compare sigmoid, relu/maxout and softmax:
# pseudo code / formula
sigmoid = f(alpha) = 1 / (1 + exp(-alpha))
relu = f(alpha) = max(0,alpha)
maxout = f(alpha) = max(alpha1, alpha2)
softmax = f(alpha_j) = alpha_j / sum_K(alpha_k)
sigmoid:
in binary classification preferably used for output layer
values can range between [0,1], suitable for a probabilistic interpretation (+)
saturated neurons can eliminate gradient (-)
not zero centered (-)
exp() is computationally expensive (-)
relu:
no saturated neurons in positive regions (+)
computationally less expensive (+)
not zero centered (-)
saturated neurons in negative regions (-)
maxout:
positive attributes of relu (+)
doubles the number of parameters per neuron, normally requires an increased learning effort (-)
softmax:
can bee seen as a generalization of sigmoid function
mainly being used as output activation function in multi-class prediction problems
values range between [0,1], suitable for a probabilistic interpretation (+)
computationally more expensive because of exp() terms (-)
Some good references for further reading:
http://cs231n.stanford.edu/2020/syllabus
http://deeplearningbook.org (Bengio & Courtville)
https://arxiv.org/pdf/1811.03378.pdf
https://papers.nips.cc/paper/2018/file/6ecbdd6ec859d284dc13885a37ce8d81-Paper.pdf
The reason why you see those different results is the size of your output layer - it is 1 neuron.
Softmax by definition requires more than 1 output neuron to make sense. 1 Softmax neuron will always output 1 (lookup the formula and think about it). That is why you see ~50% accuracy, since your network always predicts class 1.
Sigmoid doesn't have this problem and can output anything, that's why it trains.
If you want to test softmax, you have to make an output neuron for each class and then "one-hot encode" your ytrain and yval (look up one-hot encoding for more explanations). In your case this means: label 0 -> [1, 0], label 1 -> [0, 1]. You can see, the index of the one encodes the class. I'm not sure but in that case I believe you'd use the categorical cross entropy. I was not able to tell conclusively from the docs, but it seems to me that binary cross entropy expects 1 output neuron that's either 0 or 1 (where Sigmoid is the correct activation to use) whereas the categorical cross entropy expects one output neuron for each class, where Softmax makes sense. You could use Sigmoid even for the multioutput case, but it's not common.
So in short, it seems to me that binary xentropy expects the class encoded by the value of the 1 neuron, whereas categorical xentropy expects the class encoded by which output neuron is the most active. (in simplifying terms)
I already trained a neural network with the last layer using sigmoid. If I can not retrain the network with softmax, can I change the final predictions as probability? Now the output of
pred = fin_model.predict_proba(x_train)
is like
array([[0.65247375, 0.45892698],
[0.65919983, 0.4590024 ],
[0.15964866, 0.47771254],
[0.53297156, 0.47564888],
[0.16078213, 0.4779702 ]], dtype=float32)
The sum of each one like 0.6524+0.4589 is not 1, and thus can not be a probability. Is there a way to change it to probabilities?
The Sigmoid function always returns a value between 0 and 1 and mainly used in binary classification. Sigmoid activation function will mark one class as close to 0 (<=0.5) and other close to 1 (>0.5).
To use sigmoid, you need to define final layer as:
model.add(Dense(1, activation_function='sigmoid'))
However you can also use Softmax activation for binary classification. Softmax converts a vector of values to a probability distribution which gives output vector in range (0, 1) and sum to 1.
It can be declared in final layer as :
model.add(Dense(2, activation_function='softmax'))
You can get more details on softmax and sigmoid here.
I have a very simple Tensorflow 2 Keras model to do penalized logistic regression on some data. I was hoping to get the probabilties of each class, instead of just the predicted values of [0 or 1].
I think I got what I wanted, but just wanted to make sure that these numbers are what I think they are. I used the model.predict_on_batch() function from Tensorflow.keras, but the documentation just says that this provides a numpy array of predictions. However I believe I am getting probabilities, but I was hoping someone could confirm.
The model code looks like this:
feature_layer = tf.keras.layers.DenseFeatures(features)
model = tf.keras.Sequential([
feature_layer,
layers.Dense(1, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l1(0.01))
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
predictions = model.predict_on_batch(validation_dataset)
print('Predictions for a single batch.')
print(predictions)
So the predictions I am getting look like:
Predictions for a single batch.
tf.Tensor(
[[0.10916319]
[0.14546806]
[0.13057315]
[0.11713684]
[0.16197902]
[0.19613355]
[0.1388464 ]
[0.14122346]
[0.26149303]
[0.12516734]
[0.1388464 ]
[0.14595506]
[0.14595506]]
Now for predictions in a logistic regression that would be an array of either 0 or 1. But since I am getting floating point values. However, I am just getting a single value when there is actually a probability that the example is a 0 and the probability that the example is a 1. So I would imagine an array of 2 probabilities for each row or example. Of course, the Probability(Y = 0) + Probability(Y = 1) = 1, so this might just be some concise representation.
So again, do the values in the array below represent probabilities that the example or Y = 1, or something else?
The values represented here:
tf.Tensor(
[[0.10916319]
[0.14546806]
[0.13057315]
[0.11713684]
[0.16197902]
[0.19613355]
[0.1388464 ]
[0.14122346]
[0.26149303]
[0.12516734]
[0.1388464 ]
[0.14595506]
[0.14595506]]
Are the probabilities corresponding to each one of your classes.
Since you used sigmoid activation on your last layer, these will
be in the range [0, 1].
Your model is very shallow (few layers) and thus these prediction probabilities are very close between classes. I suggest you add more layers.
Conclusion
To answer your question, these are probabilities but only due to your activation function selection (sigmoid). If you used tanh activation these would be in range [-1,1].
Note that these probabilities are "binary" for each class due to the use of binary_crossentropy loss - aka 10.92% that class 1 is present and 89.08% that it is not, and so on for other classes. If you want the predictions to follow probabilistic rules (sum = 1) then you should consider categorical_crossentropy.
As part of a project for my studies I want to try and approximate a function f:R^m -> R^n using a Keras neural network (to which I am completely new). The network seems to be learning to some (indeed unsatisfactory) point. But the predictions of the network don't resemble the expected results in the slightest.
I have two numpy-arrays containing the training-data (the m-dimensional input for the function) and the training-labels (the n-dimensional expected output of the function). I use them for training my Keras model (see below), which seems to be learning on the provided data.
inputs = Input(shape=(m,))
hidden = Dense(100, activation='sigmoid')(inputs)
hidden = Dense(80, activation='sigmoid')(hidden)
outputs = Dense(n, activation='softmax')(hidden)
opti = tf.keras.optimizers.Adam(lr=0.001)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=opti,
loss='poisson',
metrics=['accuracy'])
model.fit(training_data, training_labels, verbose = 2, batch_size=32, epochs=30)
When I call the evaluate-method on my model with a set of test-data and a set of test-labels, I get an apparent accuracy of more than 50%. However, when I use the predict method, the predictions of the network do not resemble the expected results in the slightest. For example, the first ten entries of the expected output are:
[0., 0.08193582, 0.13141066, 0.13495408, 0.16852582, 0.2154705 ,
0.30517559, 0.32567417, 0.34073457, 0.37453226]
whereas the first ten entries of the predicted results are:
[3.09514281e-09, 2.20849714e-03, 3.84095078e-03, 4.99367528e-03,
6.06226595e-03, 7.18442770e-03, 8.96730460e-03, 1.03423093e-02, 1.16029680e-02, 1.31887039e-02]
Does this have something to do with the metrics I use? Could the results be normalized by Keras in some intransparent way? Have I just used the wrong kind of model for the problem I want to solve? What does 'accuracy' mean anyway?
Thank you in advance for your help, I am new to neural networks and have been stuck with this issue for several days.
The problem is with this line:
outputs = Dense(n, activation='softmax')(hidden)
We use softmax activation only in a classification problem, where we need a probability distribution over the classes as an output of the network. And so softmax makes ensures that the output sums to one and non zero (which is true in your case). But I don't think the problem at hand for you is a classification task, you are just trying to predict ten continuous target varaibles, so use a linear activation function instead. So modify the above line to something like this
outputs = Dense(n, activation='linear')(hidden)
I am trying to estimate the third band(Blue) in an RGB image using convolutional neural networks. my design using Keras is a sequentiol model with a convolution2D layer as input layer two hidden layers and output neuron. if i want loss(rmse) to be zero how should i change my model?
my model in python goes like this
in_image = skimage.io.imread('test.jpg')[0:50,0:50,:].astype(float)
data = in_image[:,:,0:2]
target = in_image[:,:,2:3]
model1 = keras.models.Sequential()
model1.add(keras.layers.Convolution2D(50,(3,3),strides = (1,1),padding = "same",input_shape=(None,None,2))) #Convolution Layer
model1.add(keras.layers.Dense(50,activation = 'relu')) # Hiden Layer1
model1.add(keras.layers.Dense(50,activation = 'sigmoid')) # Hidden Layer 2
model1.add(keras.layers.Dense(1)) # Output Layer
adadelta = keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=0.0)
model1.compile(loss='mean_squared_error', optimizer=adadelta) # Compile the model
model1.fit(np.array([data]),np.array([target]),epochs = 5000)
estimated_band = model1.predict(np.array([data]))
Given your problem setup, it looks like you're trying to training a neural network on one image such that it is able to predict the blue channel of an image from other 2 images. Putting aside the use of such an experiment, there are a few important things when training neural networks properly, including.
learning rate
weight initialization
optimizer
model complexity.
Yann Lecun's Efficient backprop is a late 90s paper that talks about numbers 1, 2 and 3. Number 4 holds on the assumption that as the number of free parameters increase, at some point you'll be able to match each parameter to each output.
Note that achieving zero-loss provides no guarantees on generalization nor does it mean that your model will not generalize, as brilliantly described in a paper presented at ICLR.