ResNet family classification layer activation function - python

I am using the ResNet18 pre-trained model which will be used for a simple binary image classification task. However, all the tutorials including PyTorch itself use nn.Linear(num_of_features, classes) for the final fully connected layer. What I fail to understand is where is the activation function for that module? Also what if I want to use sigmoid/softmax how do I go about that?
Thanks for your help in advance, I am kinda new to Pytorch

No you do not use activation in the last layer if your loss function is CrossEntropyLoss because pytorch CrossEntropyLoss loss combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
They do they do that ?
You actually need logits (output of sigmoid) for loss calculation so it is a correct design to not have it as part of forward pass. More over for predictions you don't need logits because argmax(linear(x)) == argmax(softmax(linear(x)) i.e softmax does not change the ordering but only change the magnitudes (squashing function which converts arbitrary value into [0,1] range, but preserves the partial ordering]
If you want to use activation functions to add some sort of non-linearity you normally do that by using a multi-layer NN and having the activation functions in the last but other layers.
Finally, if you are using other loss function like NLLLoss, PoissonNLLLoss, BCELoss then you have to calculates sigmoid yourself. Again on the same note if you are using BCEWithLogitsLoss you don't need to calculate sigmoid again because this loss combines a Sigmoid layer and the BCELoss in one single class.
check the pytorch docs to see how to use the loss.

Usually, no ReLU activation function is used in the last layer. The output of the torch.nn.Linear layer is fed to the softmax function of the cross-entropy loss, e.g., by using torch.nn.CrossEntropyLoss. What you may be looking for is the binary-cross-entropy loss torch.nn.BCELoss.

In the tutorials you would see on the internet, people mostly do multi-class classification, for which they use cross-entropy loss which doesn't require a user defined activation function at the output. It applies the softmax activation itself (actually applying an activation function before the cross-entropy is one of the most common mistakes in PyTorch). However, in your case you have a binary classification problem, for which you need to use binary cross-entropy loss, which doesn't apply any activation function by itself unlike the other one. So you will need to apply sigmoid activation (or any kind of activation that maps the real numbers to the range (0, 1) yourself.

Related

Why does NN training loss flatten?

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.
The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...
Why is that?
Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.
My hidden layers use leaky relu and final layer uses sigmoid activation.
Im trying to classify the mnist handwritten digits.
code:
#FORWARDPROPAGATION
for i in range(layers-1):
cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])
cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])
yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m
if j%10==0:
print(cost)
costs.append(cost)
#BACKPROPAGATION
grad={"dz"+str(layers):yn-y}
for i in range(layers):
grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
if i<layers-1:
grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])
for i in range(layers):
param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]
The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.
I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

Keras - Multilabel classification with weights

I am trying to classify some CXR images that have multiple labels per sample. From what I understand I have to put a dense layer with sigmoid activations and use the binary crossentropy as my loss function. The issue is that there is a large class imbalance (Many more normals than abnormals). I am curious here is my model sofar:
from keras_applications.resnet_v2 import ResNet50V2
from keras.layers import GlobalAveragePooling2D, Dense
from keras import Sequential
ResNet = Sequential()
ResNet.add(ResNet50V2(input_shape=shape, include_top=False, weights=None,backend=keras.backend,
layers=keras.layers,
models=keras.models,
utils=keras.utils))
ResNet.add(GlobalAveragePooling2D(name='avg_pool'))
ResNet.add(Dense(len(label_counts), activation='sigmoid', name='Final_output'))
As we can see I am using sigmoid to get an output, but I am a bit confused as to how to implement the weights. I think I need to use a custom loss function that uses BCE(use_logits = true). Something like this:
xent = tf.losses.BinaryCrossEntropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)
loss = tf.reduce_mean(xent(targets, pred) * weights))
So it treats the outputs as logits, but what I am unsure about is the activation of the final output. Do I keep it with the activation of sigmoid, or do I use a linear activation (not activated)? I assume we keep the sigmoid, and just treat it as a logit, but I am unsure as pytorches "torch.nn.BCEWithLogitsLoss" contains a sigmoid layer
EDIT: Found this: https://www.reddit.com/r/tensorflow/comments/dflsgv/binary_cross_entropy_with_from_logits_true/
As per: pgaleone
from_logits=True means that the loss function expects a linear tensor
(the output layer of your network without any activation function but
the identity), so you have to remove the sigmoid, since it will be the
loss function itself to apply the softmax to your network output, and
then to compute the cross-entropy
You actually would not want to use from_logits in multilabel classification.
From the documentation [1]:
logits: Per-label activations, typically a linear output. These activation energies are interpreted as unnormalized log probabilities.
So you are right saying that you don't want to use an activation function when it is set to True.
However, the documentation also says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results
Softmax optimizes for one class, per definition. That's how softmax is designed to work. Since you are doing multilabel classification you should use sigmoid, as you mentioned yourself.
This means that if you want to use sigmoid, you cannot use from_logits because it would apply softmax after sigmoid which is generally not what you want.
The solution is to remove this line:
from_logits=True,
[1] https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits?version=stable

Where to apply batch normalization on standard CNNs

I have the following architecture:
Conv1
Relu1
Pooling1
Conv2
Relu2
Pooling3
FullyConnect1
FullyConnect2
My question is, where do I apply batch normalization? And what would be the best function to do this in TensorFlow?
The original batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
To your second question: in tensorflow, you can use a high-level tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization.
There's some debate on this question. This Stack Overflow thread and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue that show BN performing better after the activation layers.
My current opinion (open to being corrected) is that you should do BN after the activation layer, and if you have the budget for it and are trying to squeeze out extra accuracy, try before the activation layer.
So adding Batch Normalization to your CNN would look like this:
Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to
the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015)
recommend the latter. More specifically, XW+b should be replaced by a
normalized version of XW. The bias term should be omitted because it
becomes redundant with the β parameter applied by the batch
normalization reparameterization. The input to a layer is usually the
output of a nonlinear activation function such as the rectified linear
function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear
operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It's an open debate. I suggest that you test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

What is the difference between binary crossentropy and binary crossentropy with logits in keras?

In keras backend we have a flag with_logits in K.binary_crossentropy. What is the difference between normal binary crossentropy and binary crossentropy with logits? Suppose I am using a seq2seq model and my output sequence is of type 100111100011101.
What should I use for an recursive LSTM or RNN to learn from this data provided I am giving a similar sequence in the input along with timesteps?
This depends on whether or not you have a sigmoid layer just before the loss function.
If there is a sigmoid layer, it will squeeze the class scores into probabilities, in this case from_logits should be False. The loss function will transform the probabilities into logits, because that's what tf.nn.sigmoid_cross_entropy_with_logits expects.
If the output is already a logit (i.e. the raw score), pass from_logits=True, no transformation will be made.
Both options are possible and the choice depends on your network architecture. By the way if the term logit seems scary, take a look at this question which discusses it in detail.

How to use softmax activation function at the output layer, but relus in the middle layers in TensorFlow?

I have a neural net of 3 hidden layers (so I have 5 layers in total). I want to use Rectified Linear Units at each of the hidden layers, but at the outermost layer I want to apply Softmax on the logits. I want to use the DNNClassifier. I have read the official documentation of the TensorFlow where for setting value of the parameter activation_fn they say:
activation_fn: Activation function applied to each layer. If None, will use tf.nn.relu.
I know I can always write my own model and use any arbitrary combination of the activation functions. But as the DNNClassifier is more concrete, I want to resort to that. So far I have:
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=features_columns,
hidden_units=[10,20,10],
n_classes=3
# , activation_fn:::: I want something like below
# activation_fn = [relu,relu,relu,softmax]
)
Sorry to say, but this is not possible using only one DNNClassifier.
As you show in your example, you can supply an activation_fn
Activation function applied to each layer. If None, will use tf.nn.relu.
But not a seperate one for each layer. To solve your problem, you have to chain this classifier to another layer that does have the tanh actication function.

Categories

Resources