pulling activation function from MLPRegressor object - python

I'm trying to do dimensionality reduction using: https://www.cs.toronto.edu/~hinton/science.pdf
What this does, is create an autoencoder with a middle layer consisting of 2 nodes.
after training, the neural net is "cut in half" and we just forward propagate until we get 2D data. We can plot that, and do KNN, and other fun stuff
I'm using sklearn, and I can get as far as training the network
from sklearn.neural_network import MLPRegressor
HLS = (4,2,4)
m_ann = MLPRegressor(hidden_layer_sizes=(HLS), max_iter=10**5, activation='relu')
m_ann.fit(X_train,X_train)
and I can get the NN coefficients with m_ann.coefs_ and m_ann.intercepts_
So, I can create a loop that does manual matrix multiplication, addition, then write my own relu function and call that. BUT what I'd like to do is apply whatever activation function my network uses (for example, if a user trains using linear, or sigmoid activation, I don't want them to have to change any code)
Is it possible to get MLPRegressor to apply an activation function to arbitrary data?

Related

ResNet family classification layer activation function

I am using the ResNet18 pre-trained model which will be used for a simple binary image classification task. However, all the tutorials including PyTorch itself use nn.Linear(num_of_features, classes) for the final fully connected layer. What I fail to understand is where is the activation function for that module? Also what if I want to use sigmoid/softmax how do I go about that?
Thanks for your help in advance, I am kinda new to Pytorch
No you do not use activation in the last layer if your loss function is CrossEntropyLoss because pytorch CrossEntropyLoss loss combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
They do they do that ?
You actually need logits (output of sigmoid) for loss calculation so it is a correct design to not have it as part of forward pass. More over for predictions you don't need logits because argmax(linear(x)) == argmax(softmax(linear(x)) i.e softmax does not change the ordering but only change the magnitudes (squashing function which converts arbitrary value into [0,1] range, but preserves the partial ordering]
If you want to use activation functions to add some sort of non-linearity you normally do that by using a multi-layer NN and having the activation functions in the last but other layers.
Finally, if you are using other loss function like NLLLoss, PoissonNLLLoss, BCELoss then you have to calculates sigmoid yourself. Again on the same note if you are using BCEWithLogitsLoss you don't need to calculate sigmoid again because this loss combines a Sigmoid layer and the BCELoss in one single class.
check the pytorch docs to see how to use the loss.
Usually, no ReLU activation function is used in the last layer. The output of the torch.nn.Linear layer is fed to the softmax function of the cross-entropy loss, e.g., by using torch.nn.CrossEntropyLoss. What you may be looking for is the binary-cross-entropy loss torch.nn.BCELoss.
In the tutorials you would see on the internet, people mostly do multi-class classification, for which they use cross-entropy loss which doesn't require a user defined activation function at the output. It applies the softmax activation itself (actually applying an activation function before the cross-entropy is one of the most common mistakes in PyTorch). However, in your case you have a binary classification problem, for which you need to use binary cross-entropy loss, which doesn't apply any activation function by itself unlike the other one. So you will need to apply sigmoid activation (or any kind of activation that maps the real numbers to the range (0, 1) yourself.

Keras - Multilabel classification with weights

I am trying to classify some CXR images that have multiple labels per sample. From what I understand I have to put a dense layer with sigmoid activations and use the binary crossentropy as my loss function. The issue is that there is a large class imbalance (Many more normals than abnormals). I am curious here is my model sofar:
from keras_applications.resnet_v2 import ResNet50V2
from keras.layers import GlobalAveragePooling2D, Dense
from keras import Sequential
ResNet = Sequential()
ResNet.add(ResNet50V2(input_shape=shape, include_top=False, weights=None,backend=keras.backend,
layers=keras.layers,
models=keras.models,
utils=keras.utils))
ResNet.add(GlobalAveragePooling2D(name='avg_pool'))
ResNet.add(Dense(len(label_counts), activation='sigmoid', name='Final_output'))
As we can see I am using sigmoid to get an output, but I am a bit confused as to how to implement the weights. I think I need to use a custom loss function that uses BCE(use_logits = true). Something like this:
xent = tf.losses.BinaryCrossEntropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)
loss = tf.reduce_mean(xent(targets, pred) * weights))
So it treats the outputs as logits, but what I am unsure about is the activation of the final output. Do I keep it with the activation of sigmoid, or do I use a linear activation (not activated)? I assume we keep the sigmoid, and just treat it as a logit, but I am unsure as pytorches "torch.nn.BCEWithLogitsLoss" contains a sigmoid layer
EDIT: Found this: https://www.reddit.com/r/tensorflow/comments/dflsgv/binary_cross_entropy_with_from_logits_true/
As per: pgaleone
from_logits=True means that the loss function expects a linear tensor
(the output layer of your network without any activation function but
the identity), so you have to remove the sigmoid, since it will be the
loss function itself to apply the softmax to your network output, and
then to compute the cross-entropy
You actually would not want to use from_logits in multilabel classification.
From the documentation [1]:
logits: Per-label activations, typically a linear output. These activation energies are interpreted as unnormalized log probabilities.
So you are right saying that you don't want to use an activation function when it is set to True.
However, the documentation also says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results
Softmax optimizes for one class, per definition. That's how softmax is designed to work. Since you are doing multilabel classification you should use sigmoid, as you mentioned yourself.
This means that if you want to use sigmoid, you cannot use from_logits because it would apply softmax after sigmoid which is generally not what you want.
The solution is to remove this line:
from_logits=True,
[1] https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits?version=stable

Exercise 4 by Andrew Ng in Keras.

I am studying some machine learning on my own and I am practicing (in Python) with the assignments of the course held by Andrew Ng.
After completing the fourth exercise by hand, I tought to do it in Keras to practice with the library.
In the exercise we have 5000 images of hand written digits, going from 0 to 9. Each image is a 20x20 matrix. The dataset is stored in a matrix X of shape 5000x400 (each image has been 'unrolled') and the labels are stored in a matrix y of shape 5000x10. Each row of y is a hot-one vector.
The exercise asks to implement backpropagation to maximaze the log likelihood, for a simple neural network with one input layer, one hidden layer and one output layer. The hidden layer has 25 neurons and the output layer 10. We use sigmoid as activation for both layers.
My code in Keras is this
model=Sequential()
model.add(Dense(25,input_shape=(400,),use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.add(Dense(10,use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy',optimizer='sgd',metrics=['accuracy'])
model.fit(X, y, batch_size=5000,epochs=100, verbose=1)
Since I want this to be as similar as possible to the assignment I have used the same initial weights as the assignment, the same regularization parameter, the same activations and gradient descent as a optimizer (actually the assignment uses the Truncated Newton Method but I don't think my problem lies here).
I thought I was doing everything correctly but when I train the network I get a 10% accuracy on the training dataset. Even playing a little bit with the parameters the accuracy doesn't change much. To try to understand better the problem I tested it with smaller pieces of the dataset. For instance if I select a subdataset of 100 elements containing x images of zero and 100-x images of one, I get a x% training accuracy. My guess is that the network is optimizing the parameters to recognise only the first digit.
Now my questions are: what I am missing? Why isn't this the right implementation of the neural network described above?
If you are practising on the MNIST dataset, to classify 10 digits, you have 10 classes to predict. Rather than sigmoid, you should use ReLU in the hidden layers ( in your case the first layer ) and use softmax activation on the output layer. Use categorical crossentropy loss function with adam or sgd optimizer.

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...
A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.
I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

How to use softmax activation function at the output layer, but relus in the middle layers in TensorFlow?

I have a neural net of 3 hidden layers (so I have 5 layers in total). I want to use Rectified Linear Units at each of the hidden layers, but at the outermost layer I want to apply Softmax on the logits. I want to use the DNNClassifier. I have read the official documentation of the TensorFlow where for setting value of the parameter activation_fn they say:
activation_fn: Activation function applied to each layer. If None, will use tf.nn.relu.
I know I can always write my own model and use any arbitrary combination of the activation functions. But as the DNNClassifier is more concrete, I want to resort to that. So far I have:
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=features_columns,
hidden_units=[10,20,10],
n_classes=3
# , activation_fn:::: I want something like below
# activation_fn = [relu,relu,relu,softmax]
)
Sorry to say, but this is not possible using only one DNNClassifier.
As you show in your example, you can supply an activation_fn
Activation function applied to each layer. If None, will use tf.nn.relu.
But not a seperate one for each layer. To solve your problem, you have to chain this classifier to another layer that does have the tanh actication function.

Categories

Resources