Tensorflow implementation of NT_Xent contrastive loss function?

Tensorflow implementation of NT_Xent contrastive loss function? - python

As the title suggests, I'm trying train a model based on the SimCLR framework (seen in this paper: https://arxiv.org/pdf/2002.05709.pdf - the NT_Xent loss is stated in equation (1) and Algorithm 1).
I have managed to create a numpy version of the loss function, but this is not suitable to train the model on, as numpy arrays cannot store the required information for back propagation. I am having difficulty converting my numpy code over to Tensorflow. Here is my numpy version:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Define the contrastive loss function, NT_Xent
def NT_Xent(zi, zj, tau=1):
""" Calculates the contrastive loss of the input data using NT_Xent. The
equation can be found in the paper: https://arxiv.org/pdf/2002.05709.pdf
Args:
zi: One half of the input data, shape = (batch_size, feature_1, feature_2, ..., feature_N)
zj: Other half of the input data, must have the same shape as zi
tau: Temperature parameter (a constant), default = 1.
Returns:
loss: The complete NT_Xent constrastive loss
"""
z = np.concatenate((zi, zj), 0)
loss = 0
for k in range(zi.shape[0]):
# Numerator (compare i,j & j,i)
i = k
j = k + zi.shape[0]
sim_ij = np.squeeze(cosine_similarity(z[i].reshape(1, -1), z[j].reshape(1, -1)))
sim_ji = np.squeeze(cosine_similarity(z[j].reshape(1, -1), z[i].reshape(1, -1)))
numerator_ij = np.exp(sim_ij / tau)
numerator_ji = np.exp(sim_ji / tau)
# Denominator (compare i & j to all samples apart from themselves)
sim_ik = np.squeeze(cosine_similarity(z[i].reshape(1, -1), z[np.arange(z.shape[0]) != i]))
sim_jk = np.squeeze(cosine_similarity(z[j].reshape(1, -1), z[np.arange(z.shape[0]) != j]))
denominator_ik = np.sum(np.exp(sim_ik / tau))
denominator_jk = np.sum(np.exp(sim_jk / tau))
# Calculate individual and combined losses
loss_ij = - np.log(numerator_ij / denominator_ik)
loss_ji = - np.log(numerator_ji / denominator_jk)
loss += loss_ij + loss_ji
# Divide by the total number of samples
loss /= z.shape[0]
return loss
I am fairly confident that this function produces the correct results (albeit slowly, as I have seen other implementations of it online that were vectorised versions - such as this one for Pytorch: https://github.com/Spijkervet/SimCLR/blob/master/modules/nt_xent.py (my code produces the same result for identical inputs), but I do not see how their version is mathematically equivalent to the formula in the paper, hence why I am trying to build my own).
As a first try I have converted the numpy functions to their TF equivalents (tf.concat, tf.reshape, tf.math.exp, tf.range, etc.), but I believe my only/main problem is that sklearn's cosine_similarity function returns a numpy array, and I do not know how to build this function myself in Tensorflow. Any ideas?

I managed to figure it out myself!
I did not realise there was a Tensorflow implementation of the cosine similarity function "tf.keras.losses.CosineSimilarity"
Here is my code:
import tensorflow as tf
# Define the contrastive loss function, NT_Xent (Tensorflow version)
def NT_Xent_tf(zi, zj, tau=1):
""" Calculates the contrastive loss of the input data using NT_Xent. The
equation can be found in the paper: https://arxiv.org/pdf/2002.05709.pdf
(This is the Tensorflow implementation of the standard numpy version found
in the NT_Xent function).
Args:
zi: One half of the input data, shape = (batch_size, feature_1, feature_2, ..., feature_N)
zj: Other half of the input data, must have the same shape as zi
tau: Temperature parameter (a constant), default = 1.
Returns:
loss: The complete NT_Xent constrastive loss
"""
z = tf.cast(tf.concat((zi, zj), 0), dtype=tf.float32)
loss = 0
for k in range(zi.shape[0]):
# Numerator (compare i,j & j,i)
i = k
j = k + zi.shape[0]
# Instantiate the cosine similarity loss function
cosine_sim = tf.keras.losses.CosineSimilarity(axis=-1, reduction=tf.keras.losses.Reduction.NONE)
sim = tf.squeeze(- cosine_sim(tf.reshape(z[i], (1, -1)), tf.reshape(z[j], (1, -1))))
numerator = tf.math.exp(sim / tau)
# Denominator (compare i & j to all samples apart from themselves)
sim_ik = - cosine_sim(tf.reshape(z[i], (1, -1)), z[tf.range(z.shape[0]) != i])
sim_jk = - cosine_sim(tf.reshape(z[j], (1, -1)), z[tf.range(z.shape[0]) != j])
denominator_ik = tf.reduce_sum(tf.math.exp(sim_ik / tau))
denominator_jk = tf.reduce_sum(tf.math.exp(sim_jk / tau))
# Calculate individual and combined losses
loss_ij = - tf.math.log(numerator / denominator_ik)
loss_ji = - tf.math.log(numerator / denominator_jk)
loss += loss_ij + loss_ji
# Divide by the total number of samples
loss /= z.shape[0]
return loss
As you can see, I have essentially just swapped out the numpy functions for the TF equivalents. One main point of note is that I had to use "reduction=tf.keras.losses.Reduction.NONE" within the "cosine_sim" function, this was to keep the shapes consistent in the "sim_ik" and "sim_jk", because otherwise the resulting loss did not match up with my original numpy implementation.
I also noticed that individually calculating the numerator for i,j and j,i was redundant as the answers were the same, so I have removed one instance of that calculation.
Of course if anybody has a quicker implementation I am more than happy to hear about it!

Here is a more efficient and more stable implementation. Assuming zi and zj are interlaced!
class NT_Xent(tf.keras.layers.Layer):
""" Normalized temperature-scaled CrossEntropy loss [1]
[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv. 2020, Accessed: Jan. 15, 2021. [Online]. Available: https://github.com/google-research/simclr.
"""
def __init__(self, tau=1, **kwargs):
super().__init__(**kwargs)
self.tau = tau
self.similarity = tf.keras.losses.CosineSimilarity(axis=-1, reduction=tf.keras.losses.Reduction.NONE)
self.criterion = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
def get_config(self):
return {"tau": self.tau}
def call(self, zizj):
""" zizj is [B,N] tensor with order z_i1 z_j1 z_i2 z_j2 z_i3 z_j3 ...
batch_size is twice the original batch_size
"""
batch_size = tf.shape(zizj)[0]
mask = tf.repeat(tf.repeat(~tf.eye(batch_size/2, dtype=tf.bool), 2, axis=0), 2, axis=1)
sim = -1*self.similarity(tf.expand_dims(zizj, 1), tf.expand_dims(zizj, 0))/self.tau
sim_i_j = -1*self.similarity(zizj[0::2], zizj[1::2])/self.tau
pos = tf.reshape(tf.repeat(sim_i_j, repeats=2), (batch_size, -1))
neg = tf.reshape(sim[mask], (batch_size, -1))
logits = tf.concat((pos, neg), axis=-1)
labels = tf.one_hot(tf.zeros((batch_size,), dtype=tf.int32), depth=batch_size-1)
return self.criterion(labels, logits)
source: https://github.com/gabriel-vanzandycke/tf_layers

Related

How to implement the Residual Standard Error (RSE) as a metric in Keras?

How can I compute the Residual Standard Error (RSE) as a custom metric in Keras?
The RSE is given by: sqrt[RSS / (n-2)]
Where the RSS is: sum((y_true -y_pred)**2)
This question refers to a post on stackoverflow. In this post, a user by the name of Swain Subrat Kumar shows the implementation of the Residual Standard Error (RSE). He even provides a minimum working example (MWE) which I believe to be correct.
I repost a shortened version here:
def RSE(y_true, y_predicted):
'''
y_true, y_pred: np.array()
'''
RSS = np.sum(np.square(y_true - y_predicted))
return math.sqrt(RSS / (len(y_true) - 2))
I am trying to translate this code into keras/tensorflow so that I can use it as a metric. So far, I have this:
def rse(y_true, y_pred):
'''
y_true, y_pred: tensor
'''
tmp=tf.cast(len(y_true), tf.float32) - tf.constant(2.0)
RSS = K.sum(K.square(y_true - y_pred)) # residual sum of squares
return K.sqrt(tf.math.divide(RSS, tmp))
However, this is not correct. The RSS is ok. Where it all goes wrong is in dividing the RSS by (len(y_true)-2).
How can I fix this? Many thanks in advance.
P.S.: I am having similar problems when trying to create my own variance metric.

If you are using the rse function as a metric or a loss, it's being applied to batches of data i.e; tensors which are of size (B, n) where B is the designated batch size and n being the number of elements in each vector (assuming each is 1-D). When you apply the division using len(y_true) - 2, the len function is going to return the number of samples in the batch B (the first dimension), where it should be using the value of the second dimension n. If you change the rse function to use the value of the second dimension in the tensor (y_true.shape[1]), the results are correct:
def rse(y_true, y_pred):
'''
y_true, y_pred: tensor
'''
tmp = tf.cast(y_true.shape[1], tf.float32) - tf.constant(2.0)
RSS = K.sum(K.square(y_true - y_pred)) # residual sum of squares
return K.sqrt(tf.math.divide(RSS, tmp))
In a fully reproducible dummy example:
import tensorflow as tf
import tensorflow.keras.backend as K
import numpy as np
def rse(y_true, y_pred):
'''
y_true, y_pred: tensor
'''
tmp = tf.cast(y_true.shape[1], tf.float32) - tf.constant(2.0)
RSS = K.sum(K.square(y_true - y_pred)) # residual sum of squares
return K.sqrt(tf.math.divide(RSS, tmp))
if __name__ == "__main__":
# NOTE: call `expand_dims` to simulate the idea of a batch (i.e a 2D tensor with shape (1, 5))
# so B = 1, n = 5
y_true = np.expand_dims(np.array([1, 2, 3, 4, 6], dtype=np.float32), axis=0)
y_pred = np.expand_dims(np.array([1, 2, 3, 4, 5], dtype=np.float32), axis=0)
print(rse(y_true, y_pred))
Output is:
tf.Tensor(0.57735026, shape=(), dtype=float32)
Which is correct (simply the square root of 1/3, since we only have 1 error in the example data).

Need help implementing a custom loss function in lightGBM (Zero-inflated Log Normal Loss)

Im trying to implement this zero-inflated log normal loss function based on this paper in lightGBM (https://arxiv.org/pdf/1912.07753.pdf) (page 5). But, admittedly, I just don’t know how. I don’t understand how to get the gradient and hessian of this function in order to implement it in LGBM and I’ve never needed to implement a custom loss function in the past.
The authors of this paper have open sourced their code, and the function is available in tensorflow (https://github.com/google/lifetime_value/blob/master/lifetime_value/zero_inflated_lognormal.py), but I’m unable to translate this to fit the parameters required for a custom loss function in LightGBM. An example of how LGBM accepts custom loss functions— loglikelihood loss would be written as:
def loglikelihood(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
Similarly, I would need to define a custom eval metric to accompany it, such as:
def binary_error(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
return 'error', np.mean(labels != (preds > 0.5)), False
Both of the above two examples are taken from the following repository:
https://github.com/microsoft/LightGBM/blob/e83042f20633d7f74dda0d18624721447a610c8b/examples/python-guide/advanced_example.py#L136
Would appreciate any help on this, and especially detailed guidance to help me learn how to do this on my own.
According to the LGBM documentation for custom loss functions:
It should have the signature objective(y_true, y_pred) -> grad, hess or objective(y_true, y_pred, group) -> grad, hess:
y_true: numpy 1-D array of shape = [n_samples]
The target values.
y_pred: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
group: numpy 1-D array
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
grad: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of y_pred for each sample point.
hess: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of y_pred for each sample point.

This is the "translation", as you defined it, of the tensorflow implementation. Most of the work is just defining the functions yourself (i.e. softplus, crossentropy, etc.)
The mean absolute percentage error is used in the linked paper, not sure if that is the eval metric you want to use.
import math
import numpy as np
epsilon = 1e-7
def sigmoid(x):
return 1 / (1 + math.exp(-x))
def softplus(beta=1, threshold=20):
return 1 / beta* math.log(1 + math.exp(beta*x))
def BinaryCrossEntropy(y_true, y_pred):
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
term_0 = (1-y_true) * np.log(1-y_pred + epsilon)
term_1 = y_true * np.log(y_pred + epsilon)
return -np.mean(term_0+term_1, axis=0)
def zero_inflated_lognormal_pred(logits):
positive_probs = sigmoid(logits[..., :1])
loc = logits[..., 1:2]
scale = softplus(logits[..., 2:])
preds = (
positive_probs *
np.exp(loc + 0.5 * np.square(scale)))
return preds
def mean_abs_pct_error(preds, train_data):
labels = train_data.get_label()
decile_labels=np.percentile(labels,np.linspace(10,100,10))
decile_preds=np.percentile(preds,np.linspace(10,100,10))
MAPE = sum(np.absolute(decile_preds - decile_labels)/decile_labels)
return 'error', MAPE, False
def zero_inflated_lognormal_loss(train_data,
logits):
labels = train_data.get_label()
positive = labels > 0
positive_logits = logits[..., :1]
classification_loss = BinaryCrossEntropy(
y_true=positive, y_pred=positive_logits)
loc = logits[..., 1:2]
scale = math.maximum(
softplus(logits[..., 2:]),
math.sqrt(epsilon))
safe_labels = positive * labels + (
1 - positive) * np.ones(labels.shape)
regression_loss = -np.mean(
positive * np.LogNormal(mean=loc, stdev=scale).log_prob(safe_labels),
axis=-1)
return classification_loss + regression_loss

Vectorize for loop in python

I have a following loop where I am calculating softmax transform for batches of different sizes as below
import numpy as np
def softmax(Z,arr):
"""
:param Z: numpy array of any shape (output from hidden layer)
:param arr: numpy array of any shape (start, end)
:return A: output of multinum_logit(Z,arr), same shape as Z
:return cache: returns Z as well, useful during back propagation
"""
A = np.zeros(Z.shape)
for i in prange(len(arr)):
shiftx = Z[:,arr[i,1]:arr[i,2]+1] - np.max(Z[:,int(arr[i,1]):int(arr[i,2])+1])
A[:,arr[i,1]:arr[i,2]+1] = np.exp(shiftx)/np.exp(shiftx).sum()
cache = Z
return A,cache
Since this for loop is not vectorized it is the bottleneck in my code. What is a possible solution to make it faster. I have tried using #jit of numba which makes it little faster but not enough. I was wondering if there is another way to make it faster or vectorize/parallelize it.
Sample input data for the function
Z = np.random.random([1,10000])
arr = np.zeros([100,3])
arr[:,0] = 1
temp = int(Z.shape[1]/arr.shape[0])
for i in range(arr.shape[0]):
arr[i,1] = i*temp
arr[i,2] = (i+1)*temp-1
arr = arr.astype(int)
EDIT:
I forgot to stress here that my number of class is varying. For example batch 1 has say 10 classes, batch 2 may have 15 classes. Therefore I am passing an array arr which keeps track of the which rows belong to batch1 and so on. These batches are different than the batches in traditional neural network framework
In the above example arr keeps track of starting index and end index of rows. So the denominator in the softmax function will be sum of only those observations whose index lie between the starting and ending index.

Here's a vectorized softmax function. It's the implementation of an assignment from Stanford's cs231n course on conv nets.
The function takes in optimizable parameters, input data, targets, and a regularizer. (You can ignore the regularizer as that references another class exclusive to some cs231n assignments).
It returns a loss and gradients of the parameters.
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.amax(scores,axis=1).reshape(-1,1)
softmax = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis=1).reshape(-1,1)
loss = -np.sum(np.log(softmax[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dSoftmax = softmax.copy()
dSoftmax[range(num_train), list(y)] += -1
dW = (X.T).dot(dSoftmax)
dW = dW/num_train + reg * W
return loss, dW
For comparison's sake, here is a naive (non-vectorized) implementation of the same method.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss_i = -shift_scores[y[i]] + np.log(sum(np.exp(shift_scores)))
loss += loss_i
for j in xrange(num_classes):
softmax = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j==y[i]:
dW[:,j] += (-1 + softmax) * X[i]
else:
dW[:,j] += softmax *X[i]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train + reg * W
return loss, dW
Source

What is the problem with my implementation of the cross-entropy function?

I am learning the neural network and I want to write a function cross_entropy in python. Where it is defined as
where N is the number of samples, k is the number of classes, log is the natural logarithm, t_i,j is 1 if sample i is in class j and 0 otherwise, and p_i,j is the predicted probability that sample i is in class j.
To avoid numerical issues with logarithm, clip the predictions to [10^{−12}, 1 − 10^{−12}] range.
According to the above description, I wrote down the codes by clipping the predictions to [epsilon, 1 − epsilon] range, then computing the cross_entropy based on the above formula.
def cross_entropy(predictions, targets, epsilon=1e-12):
"""
Computes cross entropy between targets (encoded as one-hot vectors)
and predictions.
Input: predictions (N, k) ndarray
targets (N, k) ndarray
Returns: scalar
"""
predictions = np.clip(predictions, epsilon, 1. - epsilon)
ce = - np.mean(np.log(predictions) * targets)
return ce
The following code will be used to check if the function cross_entropy are correct.
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.96]])
targets = np.array([[0,0,0,1],
[0,0,0,1]])
ans = 0.71355817782 #Correct answer
x = cross_entropy(predictions, targets)
print(np.isclose(x,ans))
The output of the above codes is False, that to say my codes for defining the function cross_entropy is not correct. Then I print the result of cross_entropy(predictions, targets). It gave 0.178389544455 and the correct result should be ans = 0.71355817782. Could anybody help me to check what is the problem with my codes?

You're not that far off at all, but remember you are taking the average value of N sums, where N = 2 (in this case). So your code could read:
def cross_entropy(predictions, targets, epsilon=1e-12):
"""
Computes cross entropy between targets (encoded as one-hot vectors)
and predictions.
Input: predictions (N, k) ndarray
targets (N, k) ndarray
Returns: scalar
"""
predictions = np.clip(predictions, epsilon, 1. - epsilon)
N = predictions.shape[0]
ce = -np.sum(targets*np.log(predictions+1e-9))/N
return ce
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.96]])
targets = np.array([[0,0,0,1],
[0,0,0,1]])
ans = 0.71355817782 #Correct answer
x = cross_entropy(predictions, targets)
print(np.isclose(x,ans))
Here, I think it's a little clearer if you stick with np.sum(). Also, I added 1e-9 into the np.log() to avoid the possibility of having a log(0) in your computation. Hope this helps!
NOTE: As per #Peter's comment, the offset of 1e-9 is indeed redundant if your epsilon value is greater than 0.

def cross_entropy(x, y):
""" Computes cross entropy between two distributions.
Input: x: iterabale of N non-negative values
y: iterabale of N non-negative values
Returns: scalar
"""
if np.any(x < 0) or np.any(y < 0):
raise ValueError('Negative values exist.')
# Force to proper probability mass function.
x = np.array(x, dtype=np.float)
y = np.array(y, dtype=np.float)
x /= np.sum(x)
y /= np.sum(y)
# Ignore zero 'y' elements.
mask = y > 0
x = x[mask]
y = y[mask]
ce = -np.sum(x * np.log(y))
return ce
def cross_entropy_via_scipy(x, y):
''' SEE: https://en.wikipedia.org/wiki/Cross_entropy'''
return entropy(x) + entropy(x, y)
from scipy.stats import entropy, truncnorm
x = truncnorm.rvs(0.1, 2, size=100)
y = truncnorm.rvs(0.1, 2, size=100)
print np.isclose(cross_entropy(x, y), cross_entropy_via_scipy(x, y))

CNTK: Define a custom loss function (Sørensen-Dice coefficient)

I'd like to use Wiki: Sørensen–Dice coefficient as a loss function in CNTK/Python. How can I define a custom loss function.

To answer your more general question "How can I define a custom loss function:"
In CNTK, loss functions are not special. Any expression that results in a scalar can be used as a loss function. The learner will compute the minibatch-level loss by summing up the scalar loss values of all samples in the minibatch, and backpropagate through it like through any CNTK expression.
For example, the following is a way of defining a square-error loss:
def my_square_error(x,y):
diff = x-y
return times_transpose(diff, diff)
and the cross_entropy_with_softmax() loss can be written in Python like this:
def my_cross_entropy_with_softmax(output, labels):
logZ = reduce_log_sum(output) # log of softmax denominator
return times_transpose(labels, output) - logZ
Lastly, multi-task learning can be trivially realized by using a loss function that is a weighted sum over multiple losses.

import numpy as np
import cntk as C
def dice_coefficient(x, y):
# https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
intersection = C.reduce_sum(C.element_times(x, y))
return 2 * intersection / (C.reduce_sum(x) + C.reduce_sum(y))
shape = (1, 2, 2)
x1 = np.ones(shape)
y1 = np.reshape([0, 1, 0, 1], shape)
x = C.sanitize_input(x1)
y = C.sanitize_input(y1)
dice_coefficient(x, y).eval({x: x1, y: y1})
array([ 0.66666669], dtype=float32)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow implementation of NT_Xent contrastive loss function? - python

Related

How to implement the Residual Standard Error (RSE) as a metric in Keras?

Need help implementing a custom loss function in lightGBM (Zero-inflated Log Normal Loss)

Vectorize for loop in python

What is the problem with my implementation of the cross-entropy function?

CNTK: Define a custom loss function (Sørensen-Dice coefficient)

Categories

Resources