Custom loss function implementation issue in keras - python

I am implementing a custom loss function in keras. The output of the model is 10 dimensional softmax layer. To calculate loss: first I need to find the index of y firing 1 and then subtract that value with true value. I'm doing the following:
from keras import backend as K
def diff_loss(y_true,y_pred):
# find the indices of neuron firing 1
true_ind=K.tf.argmax(y_true,axis=0)
pred_ind=K.tf.argmax(y_pred,axis=0)
# cast it to float32
x=K.tf.cast(true_ind,K.tf.float32)
y=K.tf.cast(pred_ind,K.tf.float32)
return K.abs(x-y)
but it gives error "raise ValueError("None values not supported.")
ValueError: None values not supported."
What's the problem here?

This happens because your function is not differentiable. It's made of constants.
There is simply no solution for this if you want argmax as result.
An approach to test
Since you're using "softmax", that means that only one class is correct (you don't have two classes at the same time).
And since you want index differences, maybe you could work with a single continuous result (continuous values are differentiable).
Work with only one output ranging from -0.5 to 9.5, and take the classes by rounding the result.
That way, you can have the last layer with only one unit:
lastLayer = Dense(1,activation = 'sigmoid', ....) #or another kind if it's not dense
And change the range with a lambda layer:
lambdaLayer = Lambda(lambda x: 10*x - 0.5)
Now your loss can be a simple 'mae' (mean absolute error).
The downside of this attempt is that the 'sigmoid' activation is not evenly distributed between the classes. Some classes will be more probable than others. But since it's important to have a limit, it seems at first the best idea.
This will only work if you classes follow a logical increasing sequence. (I guess they do, otherwise you'd not be trying that kind of loss, right?)

Related

Keras custom loss function with argwhere-like check

I am trying to create a custom loss function in Keras for generator that generates matrix. The matrix consists of higher number of elements and low number of their centers. Centers have high value comparing to elements - elements have value <0.1, while centers should reach value >0.5. It is important that the centers are at exact correct indices, while it is less important to fit elements. That is why I am trying to create loss that would do the following:
select all elements from y_true where value is >0.5, in numpy I would do indices = np.argwhere(y_true>0.5)
compare values at the given indices for y_true and y_pred, something like loss=(K.square(y_pred[indices]-y_true[indices]))
select all other elements indices_low = np.argwhere(y_true<0.5)
same as step 2, save i.e. as loss_low
return weighted loss, i.e. return loss*100+loss_low, simply to give higher wight to more important data
However, I cannot find a way to achieve this in keras backend, I have found a question about tf.where, trying to look for something similar to my problem but there seem to be nothing like tf.argwhere (can't find in docs, neither browsing net/SO). So how can I achieve this?
Note that the number and positions of centers can vary, and the generator is bad from start so it will not generate any or will generate way more than really should be, so I think that I can't simply use tf.where. I might be incorrect here as I am new to custom loss functions, any thoughts are welcome.
EDIT
After all it seems K.tf.where was exactly what I was looking for, so I have tried it out:
def custom_mse():
def mse(y_true, y_pred):
indices = K.tf.where(y_true>0.5)
loss = K.square(y_true[indices]-y_pred[indices])
indices = K.tf.where(y_true<0.5)
loss_low = K.square(y_true[indices]-y_pred[indices])
return 100*loss+loss_low
return mse
but this keeps throwing an error:
ValueError: Shape must be rank 1 but is rank 3 for 'loss_1/Generator_loss/strided_slice' (op: 'StridedSlice') with input shapes: [?,?,?,?], [1,?,4], [1,?,4], [1].
How can I use the where output?
After a while I finally found the correct solution, so it might help somebody in the future:
Firstly my code was biased by my long time work with numpy and Pandas, thus I have expected tf elements can be addressed as y_true[indices], there are actually built in functions tf.gather and tf.gather_nd for getting elements of a tensor. However, since number of elements in both losses are different, I can't use this because counting losses together will lead to incorrect size error.
This led me to a different approach, thanks to this Q&A. Understanding the code in the accepted answer I have found that you can use tf.where not only to get indices, but as well to apply masks to your tensors. The final solution for my problem is then to apply two masks on the input tensor and calculate two losses, one where I count loss for higher values and one where I count loss for lower values, then multiply the loss that should have higher weight.
def custom_mse():
def mse(y_true, y_pred):
great = K.tf.greater(y_true,0.5)
loss = K.square(tf.where(great, y_true, tf.zeros(tf.shape(y_true)))-tf.where(great, y_pred, tf.zeros(tf.shape(y_pred))))
lower = K.tf.less(y_true,0.5)
loss_low = K.square(tf.where(lower, y_true, tf.zeros(tf.shape(y_true)))-tf.where(lower, y_pred, tf.zeros(tf.shape(y_pred))))
return 100*loss+loss_low
return mse

tensorflow 2 keras shuffle each row gradient problem

I need a NN which will be giving same output for any permutation of same input. Was trying to search for solution ('permutation invariance'), found some layers, but failed to make them work.
I chose different approach: I want to create a layer, to add as I first in the model, which will randomly shuffle input (each row independently) - please let's follow this approach, I know it can be done outside the model, but I want it as a part of the model. I tried:
class ShuffleLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ShuffleLayer, self).__init__(**kwargs)
def call(self, inputs):
batchSize = tf.shape(inputs)[0]
cols = tf.shape(inputs)[-1]
order0 = tf.tile(tf.expand_dims(tf.range(0, batchSize), -1), [1, cols])
order1 = tf.argsort(tf.random.uniform(shape=(batchSize, cols)))
indices = tf.stack([tf.reshape(order0, [-1]), tf.reshape(order1, [-1])], axis=-1)
outputs = tf.reshape(tf.gather_nd(inputs, indices), [batchSize, cols])
return outputs
I am getting following error:
ValueError: Variable has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without
gradient: K.argmax, K.round, K.eval.
How to avoid it ?? I tried to use tf.stop_gradient, but unsuccessfully.
Use Lambda layers:
First of all, if your layer doesn't have trainable weights, you should use a Lambda layer, not a custom layer. It's way simpler and easier.
def shuffleColumns(inputs):
batchSize = tf.shape(inputs)[0]
cols = tf.shape(inputs)[-1]
order0 = tf.tile(tf.expand_dims(tf.range(0, batchSize), -1), [1, cols])
order1 = tf.argsort(tf.random.uniform(shape=(batchSize, cols)))
indices = tf.stack([tf.reshape(order0, [-1]), tf.reshape(order1, [-1])], axis=-1)
outputs = tf.reshape(tf.gather_nd(inputs, indices), [batchSize, cols])
return outputs
In the model, use a Lambda(shuffleColumns) layer.
About the error
If this is the first layer, this error is probably not caused by this layer. (Unless newer versions of Tensorflow are demanding that custom layers have weights and def build(self, input_shape): defined, which doesn't seem very logical).
It seems you are doing something else in another place. The error is: you are using some operation that blocks backpropagation because it's impossible to have the derivative of that operation.
Since the derivatives are taken with respect to the model's "weights", this means that the operation is necessarily after the first weight tensor in the model (ie: after the first layer that contains trainable weights).
You need to search for anything in your model that doesn't have derivatives, like the error suggests: round, argmax, conditionals that return constants, losses that return sorted y_true but don't return operations on y_pred, etc.
Of course that K.stop_gradients is also an operation that blocks backpropagation and will certainly cause this error if you just use it like that. (This may even be the "cause" of your problem, not the solution)
Below there are easier suggestions for your operation, but none of them will fix this error because this error is somewhere else.
Suggested operation 1
Now, it would be way easier to use tf.random.shuffle for this:
def shuffleColumns(x):
x = tf.transpose(x)
x = tf.random.shuffle(x)
return tf.transpose(x)
Use a Lambda(shuffleColumns) layer in your model. It's true that this will shuffle all columns equally, but every batch will have a different permutation. And since you're going to have many epochs, and you will be shuffling (I presume) samples between each epoch (this is automatic in fit), you will hardly ever have repeated batches. So:
each batch will have a different permutation
it will be almost impossible to have the same batch two times
This approach will probably be way faster than yours.
Suggested operation 2
If you want them permutation invariant, why not use tf.sort instead of permutations? Sort the columns and, instead of having infinite permutations to train, you simply eliminate any possibility of permutation. The model should learn faster, and yet the order of the columns in your input will not be taken into account.
Use the layer Lambda(lambda x: tf.sort(x, axis=-1))
This suggestion must be used both in training and inference.

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

How to use Softmax Activation function within a Neural Network

Understanding until now- An activation function is applied on the neuron.What goes inside the function is the sum of each(connected-neuron-value*connected-weights).A single value enters the function,single value is returned from it. The above understanding works fine with tanh and sigmoid . Now I know how softmax works and it sums the values and everything other related to it.What confuses me is that softmax takes an array of numbers, I start questioning what are the sources of these numbers which forms the array ?
The following picture gives more insight into the question
Softmax works on an entire layer of neurons, and must have all their values to compute each of their outputs.
The softmax function looks like softmax_i(v) = exp(v_i)/sum_j(exp(v_j)), where v would be your neuron values (in your image, [0.82, 1.21, 0.74]), and exp is just exp(x) = e^x. Thus, exp(v_i) would be [2.27, 3.35, 2.096]. Divide each of those values by the sum of the entire vector, and you get [0.29, 0.43, 0.27]. These are the activation outputs of your neurons.
This is useful because the values add up to 1 (forgive the rounding errors in the example above that sum to 0.99... you get the idea), and thus can be interpreted as probabilities, e.g., the probability that an image is one particular class (when it can only belong to one class). That's why the computation needs to know the values of the entire vector of neurons, and can't be computed if you only know the value of a single neuron.
Note that, because of this, you don't usually have another layer after the softmax. Usually, the softmax is applied as the activation on your output layer, not a middle layer like you show. That said, it's perfectly valid to build a network the way you show, you'll just have another weight layer going to your single output neuron, and you'll have no more guarantee about what that output value might be. A more typical architecture would be something 2 neurons -> 3 neurons (sigmoid) -> 4 neurons (softmax) and now you'll have the probability that your input value falls into one of four classes.

Why does Theano throw NaNs when I use dropouts?

I am training a simple feed-forward model with 3 or 4 hidden layers and dropouts between each (hidden layer + non linearity) combination.
Sometimes after a few epochs (about 10-11) the model starts outputting Infs and NaNs as the error of the NLL and the accuracy falls to 0.0%. This problem does not happen when I do not use dropouts. Is this a known issue with dropouts in Theano? The way I implement dropouts is:
def drop(self, input):
mask = self.theano_rng.binomial(n=1, p=self.p, size=input.shape, dtype=theano.config.floatX)
return input * mask
where input is the feature-vector on which we want to apply dropouts.
I have also observed that the occurance of NaNs happens earlier if the dropout probability (self.p) is higher. p = 0.5 would cause NaNs to occur around epoch 1 or 2 but p = 0.7 would cause NaNs to occur around epoch 10 or 11.
Also the occurrence of NaNs happens only when hidden layer sizes are large. For example (800,700,700) gives NaNs whereas (500,500,500) does not.
in my experience, NaNs, when training a network usually happen because of two problems:
first, mathematical error, e.g. log of negative value. It could happen when you are using log() in your loss function.
Second, there is a value that becomes too big so python can't handle.
In your case, from your good observation, I think it's a second case. Your loss value may become too big to handled by python. Try to initialize smaller weight when you try to expand your network. Or just use different approach to initialize weight like explained by Glorot (2010) or He (2015). Hope it helps.

Categories

Resources