Autoencoder for Tabular Data with Discrete Values - python

I want to use an autoencoder for dimension reduction in Keras. The input is a table with discrete values 0,1,2,3,4 (each of these numbers show a category) in the columns. Each subject has a label 0/1 to show sick/healthy. Now I have two questions:
Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)

Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
The activation in the last layer should be sigmoid and use binary_crossentropy loss function for training.
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)
It depends on the nature of discrete values you mentioned. As you know, inputs to a neural network represents the "intensity" of each neurons; higher values mean the neuron being more intensive/active. So, categorical values as input to a NN only makes sense if they map to a continuous range. For example if excellent=3, good=2, bad=1, terrible=0, it's okay to feed these values to a NN because it makes sense to calculate f(wx+b) (intensity of the neuron) as a value of 1.5 means somewhere between bad and good.
However if the categorical values are pure nomial values without any relationship between them (for example: apple=1, orange=2, banana=3), it really doen't make sense to calculate the f(wx+b). In this case what does value 1.5 mean? For this type of data as input to a NN you should convert them to a binary encoding. For example if you have only 3 fruits you can encode this way:
apple = [1, 0, 0]
orange = [0, 1, 0]
banana = [0, 0, 1]
For this binary conversion, Keras has an utility function: to_categorical.

Related

Labels -1, 0 and 1 for classification in Tensorflow

I am trying to write a model that outputs a vector of length N consisting of labels -1,0 and 1. Each of the labels depicts one of three decisions for the system participants (wireless devices). So the vector depicts a system state that is then passed on to an optimization problem in the next step. Due to the fix problem formulation that is awaiting the output vector a selection of 0,1 and 2 instead is not possible.
After coming across this tanh function to supply the -1,0 and 1 values:
1.5 * backend.tanh(alpha * x) + 0.5 * (backend.tanh(-(3 / alpha) * x)) from here, I was wondering how exactly this output layer and the penultimate layer can be built to suply this vector of labels {-1,0,1}. I tried using the above function in the output layer in a simple Iris classificator. But this resulted in terrible accuracy compared to the one achieved with 0,1,2 and softmax output layer.
Thanks in advance,
with kind regards,
Yuka
It doesn't seem like the outputs are actually "numerically related", for lack of a better term. Meaning, the labels could just as well be "left", "right", "up". So I think your best bet is to have 3 output nodes in the final layer, with softmax activation function, with each of the three nodes representing each of the three labels, using a Cross entropy loss function.
If your training data currently has the target as -1/0/1, you should one-hot encode it so that each target is a vector of length 3. So label 0 might be [0,1,0]

Pytorch - (Categorical) Cross Entropy Loss using one hot encoding and softmax

I'm looking for a cross entropy loss function in Pytorch that is like the CategoricalCrossEntropyLoss in Tensorflow.
My labels are one hot encoded and the predictions are the outputs of a softmax layer. For example (every sample belongs to one class):
targets = [0, 0, 1]
predictions = [0.1, 0.2, 0.7]
I want to compute the (categorical) cross entropy on the softmax values and do not take the max values of the predictions as a label and then calculate the cross entropy. Unfortunately, I did not find an appropriate solution since Pytorch's CrossEntropyLoss is not what I want and its BCELoss is also not exactly what I need (isn't it?).
Does anyone know which loss function to use in Pytorch or how to deal with it?
Many thanks in advance!
I thought Tensorflow's CategoricalCrossEntropyLoss was equivalent to PyTorch's CrossEntropyLoss but it seems not. The former takes OHEs while the latter takes labels as well. It seems, however, that the difference is:
torch.nn.CrossEntropyLoss is a combination of torch.nn.LogSoftmax and torch.nn.NLLLoss():
tf.keras.losses.CategoricalCrossEntropyLoss is something like:
Your predictions have already been through a softmax. So only the negative log-likelihood needs to be applied. Based on what was discussed here, you could try this:
class CategoricalCrossEntropyLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, y_hat, y):
return F.nll_loss(y_hat.log(), y.argmax(dim=1))
Above the prediction vector is converted from one-hot-encoding to label with torch.Tensor.argmax.
If that's correct why not just use torch.nn.CrossEntropyLoss in the first place? You would just have to remove the softmax on your model's last layer and convert your targets labels.

TensorFlow: Sample Integers from Gumbel Softmax

I am implementing a program to sample integers from a categorical distribution, where each integer is associated with a probability. I need to ensure that this program is differentiable, so that back propagation can be applied. I found tf.contrib.distributions.RelaxedOneHotCategorical which is very close to what I am trying to achieve.
However, the sample method of this class returns a one-hot vector, instead of an integer. How to write a program that is both differentiable and returns an integer/scalar instead of a vector?
The reason that RelaxedOneHotCategorical is actually differentiable is connected to the fact that it returns a softmax vector of floats instead of the argmax int index. If all you want is the index of the maximal element, you might as well use Categorical.
You can do a dot product of the relaxed one hot vector with a vector of [1 2 3 4 ... n]. The result is going to give you the desired scalar.
For instance if your one hot vector is [0 0 0 1], then dot([0 0 0 1],[1 2 3 4]) will give you 4 which is what you are looking for.
You can't get what you want in a differentiable manner because argmax isn't differentiable, which is why the Gumbel-Softmax distribution was created in the first place. This allows you, for instance, to use the outputs of a language model as inputs to a discriminator in a generative adversarial network because the activation approaches a one-hot vector as the temperature changes.
If you simply need to retrieve the maximal element at inference or testing time, you can use tf.math.argmax. But there's no way to do that in a differentiable manner.

Multi-Task Learning: Train a neural network to have different loss functions for the two classes?

I have a neural net with two loss functions, one is binary cross entropy for the 2 classes, and another is a regression. Now I want the regression loss to be evaluated only for class_2, and return 0 for class_1, because the regressed feature is meaningless for class_1.
How can I implement such an algorithm in Keras?
Training it separately on only class_1 data doesn't work because I get nan loss. There are more elegant ways to define the loss to be 0 for one half of the dataset and mean_square_loss for another half?
This is a question that's important in multi-task learning where you have multiple loss functions, a shared neural network structure in the middle, and inputs that may not all be valid for all loss functions.
You can pass in a binary mask which are 1 or 0 for each of your loss functions, in the same way that you pass in the labels. Then multiply each loss by its corresponding mask. The derivative of 1x is just dx, and the derivative of 0x is 0. You end up zeroing out the gradient in the appropriate loss functions. Virtually all optimizers are additive optimizers, meaning you're summing the gradient, adding a zero is a null operation. Your final loss function should be the sum of all your other losses.
I don't know much about Keras. Another solution is to change your loss function to use the labels only: L = cross_entropy * (label / (label + 1e-6)). That term will be almost 0 and almost 1. Close enough for government work and neural networks at least. This is what I actually used the first time before I realized it was as simple as multiplying by an array of mask values.
Another solution to this problem is to us tf.where and tf.gather_nd to select only the subset of labels and outputs that you want to compare and then pass that subset to the appropriate loss function. I've actually switched to using this method rather than multiplying by a mask. But both work.

How to use Softmax Activation function within a Neural Network

Understanding until now- An activation function is applied on the neuron.What goes inside the function is the sum of each(connected-neuron-value*connected-weights).A single value enters the function,single value is returned from it. The above understanding works fine with tanh and sigmoid . Now I know how softmax works and it sums the values and everything other related to it.What confuses me is that softmax takes an array of numbers, I start questioning what are the sources of these numbers which forms the array ?
The following picture gives more insight into the question
Softmax works on an entire layer of neurons, and must have all their values to compute each of their outputs.
The softmax function looks like softmax_i(v) = exp(v_i)/sum_j(exp(v_j)), where v would be your neuron values (in your image, [0.82, 1.21, 0.74]), and exp is just exp(x) = e^x. Thus, exp(v_i) would be [2.27, 3.35, 2.096]. Divide each of those values by the sum of the entire vector, and you get [0.29, 0.43, 0.27]. These are the activation outputs of your neurons.
This is useful because the values add up to 1 (forgive the rounding errors in the example above that sum to 0.99... you get the idea), and thus can be interpreted as probabilities, e.g., the probability that an image is one particular class (when it can only belong to one class). That's why the computation needs to know the values of the entire vector of neurons, and can't be computed if you only know the value of a single neuron.
Note that, because of this, you don't usually have another layer after the softmax. Usually, the softmax is applied as the activation on your output layer, not a middle layer like you show. That said, it's perfectly valid to build a network the way you show, you'll just have another weight layer going to your single output neuron, and you'll have no more guarantee about what that output value might be. A more typical architecture would be something 2 neurons -> 3 neurons (sigmoid) -> 4 neurons (softmax) and now you'll have the probability that your input value falls into one of four classes.

Categories

Resources