I am new to tensorflow and have problems defining a custom loss function for a customer churn problem, which includes a list of values as penalty.
So far, I replicated a mean squared error function that penalizes wrong predictions with an integer.
def rfm_penalty(y_true, y_pred):
penalty = # Integer value, to be replaced by list
loss = tf.where(tf.less(y_true * y_pred, 0),
penalty * tf.square(y_true - y_pred), # penalize negat. (wrong) preds
tf.square(y_true - y_pred)) # no penalty for pos. preds
return tf.reduce_mean(loss, axis=-1)
This one works, but I'd like to modify it: Previously, I calculated a metric that measures the value of a customer, called RFM (for recency, frequency and monetary value of past purchases). This metric is an integer value from 3 to 12 that sums up the three metrics R, F and M. It is stored in a feature column of my df_train.
df_train['RFM_Score'] = [3,6,5,9,11,12,4,...,4] # same dimensions as y_true
I would like to use this feature column (or list) as penalty, thus penalizing wrong predictions higher for highly valuable customers. I would be happy for every idea to do that, even better with a sigmoid function, as its a binary classification case.
Thank you!
Related
I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.
I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)
I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.
It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives
I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.
the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.
When I use keras's binary_crossentropy as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity). So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add() I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n classes you will get probabilities of around 1/n for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n), under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k))) where p are the true probabilities (labels), q are the predictions, k are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k) is 1, all others are 0. Thus, the term reduces to -log(q(k)) where k is now the correct class. If with a randomly initialized network q(k) ~ 1/n, we get -log(1/n) = log(n).
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q). If p and q are the same distributions (e.g. p is uniform when we have the same number of examples for each class, and q is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p).
Now, since the training objective is usually to reduce cross entropy, we can think of log(n) as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1 and so your cross entropy will generally be quite small.
I need to incorporate additional information into a Keras loss function that depends on the current batch. Since Keras losses only take two arguments, I considered adding this information by making the loss function call next() on a generator object. However, the generator is only called once (probably when adding the loss function in model.compile()).
Here is a sample code:
data_batches = yield_data_batches()
meta_batches = yield_meta_batches()
....
model.compile(loss=loss_function, ...)
model.fit_generator(generator=data_batches, ....)
def loss_function(x, y):
meta_x, meta_y = next(meta_batches)
x *= meta_x # component-wise matrix multiplication
y *= meta_y # component-wise matrix multiplication
return mse(x, y)
Is there a way to make the loss function get a new meta_batch each time it is evaluated on a data_batch? Or is there another way to incorporate this meta information into the loss function?
Clarification:
The meta_x and meta_y are binary matrices that should cancel out certain elements from the prediction as they should not count to the loss.
For example:
y_true = (a,b,c,0)
y_pred = (d,e,f,g)
y_meta = (1,1,1,0)
Now, y_pred*y_meta should cancel out g so that it does not count to the loss.
This does not work, since the loss function will be compiled and added to the compute graph. Your loss function may only depend on y_pred and y_true.
You can either incorporate this information in y_true, or weight the resulting loss with sample weights.
Your approach would be equivalent to a combination of both:
Assuming positive weights a and b (you call them meta_x and meta_y):
|ax-by| = a|x-b/a*y|, so you just weight y_pred by b/a and add the sample weight a.