Background
In Tensorflow 2, there exists a class called GradientTape which is used to record operations on tensors, the result of which can then be differentiated and fed to some minimization algorithm. For example, from the documentation we have this example:
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
The docstring for the gradient method implies that the first argument can be not just a tensor, but a list of tensors:
def gradient(self,
target,
sources,
output_gradients=None,
unconnected_gradients=UnconnectedGradients.NONE):
"""Computes the gradient using operations recorded in context of this tape.
Args:
target: a list or nested structure of Tensors or Variables to be
differentiated.
sources: a list or nested structure of Tensors or Variables. `target`
will be differentiated against elements in `sources`.
output_gradients: a list of gradients, one for each element of
target. Defaults to None.
unconnected_gradients: a value which can either hold 'none' or 'zero' and
alters the value which will be returned if the target and sources are
unconnected. The possible values and effects are detailed in
'UnconnectedGradients' and it defaults to 'none'.
Returns:
a list or nested structure of Tensors (or IndexedSlices, or None),
one for each element in `sources`. Returned structure is the same as
the structure of `sources`.
Raises:
RuntimeError: if called inside the context of the tape, or if called more
than once on a non-persistent tape.
ValueError: if the target is a variable or if unconnected gradients is
called with an unknown value.
"""
In the above example, it is easy to see that y, the target, is the function to be differentiated, and x is the dependent variable the "gradient" is taken with respect to.
From my limited experience, it appears that the gradient method returns a list of tensors, one per each element of sources, and each of these gradients is a tensor that is the same shape as the corresponding member of sources.
Question
The above description of the behavior of gradients makes sense if target contains a single 1x1 "tensor" to be differentiated, because mathematically a gradient vector should be the same dimension as the domain of the function.
However, if target is a list of tensors, the output of gradients is still the same shape. Why is this the case? If target is thought of as a list of functions, shouldn't the output resemble something like a Jacobian? How am I to interpret this behavior conceptually?
This is how tf.GradientTape().gradient() is defined. It has the same functionality as the tf.gradients(), except that the latter can't be used in eager mode. From the docs of tf.gradients():
It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys
where xs are sources and ys are target.
Example 1:
So let's say target = [y1, y2] and sources = [x1, x2]. The result will be:
[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
Example 2:
Compute gradients for loss-per-sample (tensor) vs reduced loss (scalar)
Let w, b be two variables.
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
= [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db]
== 0.5 * grads
Tensorflow example of the above snippet:
import tensorflow as tf
print(tf.__version__) # 2.1.0
inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])
def forward(inputs, labels, var_list):
w, b = var_list
logits = tf.matmul(inputs, w) + b
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits)
return xentropy
# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597 0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
# [ 0.2607238 -0.26072377]]
# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
# [ 0.1303619 -0.13036188]]
If you compute loss (xentropy) for each element in a batch the final gradients of each variable will be the sum of all gradients for each sample in a batch (which makes sense).
Related
I'm puzzled by the fact that I've seen blocks of code that require tf.GradientTape().watch() to work and blocks that seem to work without it.
For example, this block of code requires the watch() function:
x = tf.ones((2,2)) #Note: In original posting of question,
# I didn't include this key line.
with tf.GradientTape() as t:
# Record the actions performed on tensor x with `watch`
t.watch(x)
# Define y as the sum of the elements in x
y = tf.reduce_sum(x)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
But, this block does not:
with tf.GradientTape() as tape:
logits = model(images, images = True)
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy.mean()
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradient(zip(grads, model.trainable_vaiables))
What is the difference between these two cases?
The documentation of watch says the following:
Ensures that tensor is being traced by this tape.
Any trainable variable that is accessed within the context of tape is watched by default. This means we can calculate the gradient with respect to that trainable variable by calling t.gradient(loss, variable). Check the following example:
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets, training=True)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
Hence no need to use tape.watch in the above code. But sometimes we need to calculate gradients with respect to some non trainable variable. In those cases we need to watch them.
with tf.GradientTape() as t:
t.watch(images)
predictions = cnn_model(images)
loss = tf.keras.losses.categorical_crossentropy(expected_class_output, predictions)
gradients = t.gradient(loss, images)
In the above code, images is input to model and not a trainable variable. I need to calculate the gradient of loss with respect to the images and hence I need to watch it.
Although the answer and comment given above were helpful, I think this code most clearly illustrates what I needed to know.
# Define a 2x2 array of 1's
x = tf.ones((2,2))
x1 = tf.Variable([[1,1],[1,1]],name='x1',dtype=tf.float32)
with tf.GradientTape(persistent=True) as t:
# Record the actions performed on tensor x with `watch`
# Define y as the sum of the elements in x
y = tf.reduce_sum(x) + tf.reduce_sum(x1)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
dz_dx1 = t.gradient(z, x1)
# Print results
print("dz_dx =",dz_dx)
print("dz_dx1 =", dz_dx1)
print("type(x) =", type(x))
print("type(x1) =", type(x1))
which gives,
dz_dx = None
dz_dx1 = tf.Tensor(
[[16. 16.]
[16. 16.]], shape=(2, 2), dtype=float32)
type(x) = <class 'tensorflow.python.framework.ops.EagerTensor'>
type(x1) = <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
In Tensorflow, all variables are are given the property Trainable=True by default. So, they are automatically watched. I neglected (out of ignorance) to originally indicate that x was not a variable, but instead, was an EagerTensor, which is not watched by default.
model.trainable_variables is a list of trainable variables, which are watched by gradient tape, whereas a regular Tensor is not watched without specifically telling the tape to watch it.
Im trying to implement this zero-inflated log normal loss function based on this paper in lightGBM (https://arxiv.org/pdf/1912.07753.pdf) (page 5). But, admittedly, I just don’t know how. I don’t understand how to get the gradient and hessian of this function in order to implement it in LGBM and I’ve never needed to implement a custom loss function in the past.
The authors of this paper have open sourced their code, and the function is available in tensorflow (https://github.com/google/lifetime_value/blob/master/lifetime_value/zero_inflated_lognormal.py), but I’m unable to translate this to fit the parameters required for a custom loss function in LightGBM. An example of how LGBM accepts custom loss functions— loglikelihood loss would be written as:
def loglikelihood(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
Similarly, I would need to define a custom eval metric to accompany it, such as:
def binary_error(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
return 'error', np.mean(labels != (preds > 0.5)), False
Both of the above two examples are taken from the following repository:
https://github.com/microsoft/LightGBM/blob/e83042f20633d7f74dda0d18624721447a610c8b/examples/python-guide/advanced_example.py#L136
Would appreciate any help on this, and especially detailed guidance to help me learn how to do this on my own.
According to the LGBM documentation for custom loss functions:
It should have the signature objective(y_true, y_pred) -> grad, hess or objective(y_true, y_pred, group) -> grad, hess:
y_true: numpy 1-D array of shape = [n_samples]
The target values.
y_pred: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
group: numpy 1-D array
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
grad: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of y_pred for each sample point.
hess: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of y_pred for each sample point.
This is the "translation", as you defined it, of the tensorflow implementation. Most of the work is just defining the functions yourself (i.e. softplus, crossentropy, etc.)
The mean absolute percentage error is used in the linked paper, not sure if that is the eval metric you want to use.
import math
import numpy as np
epsilon = 1e-7
def sigmoid(x):
return 1 / (1 + math.exp(-x))
def softplus(beta=1, threshold=20):
return 1 / beta* math.log(1 + math.exp(beta*x))
def BinaryCrossEntropy(y_true, y_pred):
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
term_0 = (1-y_true) * np.log(1-y_pred + epsilon)
term_1 = y_true * np.log(y_pred + epsilon)
return -np.mean(term_0+term_1, axis=0)
def zero_inflated_lognormal_pred(logits):
positive_probs = sigmoid(logits[..., :1])
loc = logits[..., 1:2]
scale = softplus(logits[..., 2:])
preds = (
positive_probs *
np.exp(loc + 0.5 * np.square(scale)))
return preds
def mean_abs_pct_error(preds, train_data):
labels = train_data.get_label()
decile_labels=np.percentile(labels,np.linspace(10,100,10))
decile_preds=np.percentile(preds,np.linspace(10,100,10))
MAPE = sum(np.absolute(decile_preds - decile_labels)/decile_labels)
return 'error', MAPE, False
def zero_inflated_lognormal_loss(train_data,
logits):
labels = train_data.get_label()
positive = labels > 0
positive_logits = logits[..., :1]
classification_loss = BinaryCrossEntropy(
y_true=positive, y_pred=positive_logits)
loc = logits[..., 1:2]
scale = math.maximum(
softplus(logits[..., 2:]),
math.sqrt(epsilon))
safe_labels = positive * labels + (
1 - positive) * np.ones(labels.shape)
regression_loss = -np.mean(
positive * np.LogNormal(mean=loc, stdev=scale).log_prob(safe_labels),
axis=-1)
return classification_loss + regression_loss
I have a following loop where I am calculating softmax transform for batches of different sizes as below
import numpy as np
def softmax(Z,arr):
"""
:param Z: numpy array of any shape (output from hidden layer)
:param arr: numpy array of any shape (start, end)
:return A: output of multinum_logit(Z,arr), same shape as Z
:return cache: returns Z as well, useful during back propagation
"""
A = np.zeros(Z.shape)
for i in prange(len(arr)):
shiftx = Z[:,arr[i,1]:arr[i,2]+1] - np.max(Z[:,int(arr[i,1]):int(arr[i,2])+1])
A[:,arr[i,1]:arr[i,2]+1] = np.exp(shiftx)/np.exp(shiftx).sum()
cache = Z
return A,cache
Since this for loop is not vectorized it is the bottleneck in my code. What is a possible solution to make it faster. I have tried using #jit of numba which makes it little faster but not enough. I was wondering if there is another way to make it faster or vectorize/parallelize it.
Sample input data for the function
Z = np.random.random([1,10000])
arr = np.zeros([100,3])
arr[:,0] = 1
temp = int(Z.shape[1]/arr.shape[0])
for i in range(arr.shape[0]):
arr[i,1] = i*temp
arr[i,2] = (i+1)*temp-1
arr = arr.astype(int)
EDIT:
I forgot to stress here that my number of class is varying. For example batch 1 has say 10 classes, batch 2 may have 15 classes. Therefore I am passing an array arr which keeps track of the which rows belong to batch1 and so on. These batches are different than the batches in traditional neural network framework
In the above example arr keeps track of starting index and end index of rows. So the denominator in the softmax function will be sum of only those observations whose index lie between the starting and ending index.
Here's a vectorized softmax function. It's the implementation of an assignment from Stanford's cs231n course on conv nets.
The function takes in optimizable parameters, input data, targets, and a regularizer. (You can ignore the regularizer as that references another class exclusive to some cs231n assignments).
It returns a loss and gradients of the parameters.
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.amax(scores,axis=1).reshape(-1,1)
softmax = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis=1).reshape(-1,1)
loss = -np.sum(np.log(softmax[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dSoftmax = softmax.copy()
dSoftmax[range(num_train), list(y)] += -1
dW = (X.T).dot(dSoftmax)
dW = dW/num_train + reg * W
return loss, dW
For comparison's sake, here is a naive (non-vectorized) implementation of the same method.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss_i = -shift_scores[y[i]] + np.log(sum(np.exp(shift_scores)))
loss += loss_i
for j in xrange(num_classes):
softmax = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j==y[i]:
dW[:,j] += (-1 + softmax) * X[i]
else:
dW[:,j] += softmax *X[i]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train + reg * W
return loss, dW
Source
I am trying to build a neural network using custom activation functions. I followed the solution given here, and it works when the input and output vectors have the same size, but not when using different sizes (like in a pooling function). Here is my problem so far:
I am trying to generalize this to the case when the input and the output have different sizes. In my code the input 'x' is of size (2,4), the output 'y' is of size (1,2), and the activation function MEX(.) does the mapping y = MEX(x). I have computed the gradient of MEX() as d_MEX(), where d_MEX(x) has the same size as 'x', that is (2,4). Nevertheless, I get this error
InvalidArgumentError (see above for traceback): Incompatible shapes: [1,2] vs. [2,4]
Shouldn't the gradient of MEX(x) be of the same size as x? Here is my complete code:
import tensorflow as tf
import numpy as np
# This is our target function
def MEX(x):
'''
:param x: is a row vector which is the concatenation of [input, beta]
:return MEX_{beta}(x): scalar output
'''
# lenx = np.size(x) # Number of columns (ROW vector)
lenx = x.shape[1]
N = x.shape[0]
out = np.zeros((1,N))
for ii in range(N):
c = x[ii,0:lenx-1]
beta = x[ii,lenx-1]
out[0,ii] = 1./beta * np.log( np.mean( np.exp(beta*c) ))
return np.array(out)
# Now we should write its derivative.
def d_MEX(x):
# lenx = np.size(x) # Number of
lenx = x.shape[1]
N = x.shape[0]
out = np.zeros((N,lenx))
for ii in range(N):
c = x[ii,0:lenx-1]
beta = x[ii,lenx-1]
d_beta = np.array([0.])
d_beta[0] = -1./beta*( MEX(np.array([x[ii,:]])) - np.mean( np.multiply( c, np.exp(beta*c)))/np.mean( np.exp(beta*c)) )
d_c = 1./lenx*np.exp(beta*c) /np.mean( np.exp(beta*c))
out[ii,:] = np.concatenate((d_c,d_beta), axis=0)
return out
# The first step is making it into a numpy function, this is easy:
np_MEX = np.vectorize(MEX, excluded=['x']) # IMPORTANT!! Otherwise np.vectorize() doesnt work
np_d_MEX = np.vectorize(d_MEX, excluded=['x']) # IMPORTANT!! Otherwise np.vectorize() doesnt work
# Now we make a tensforflow function
'''
Making a numpy fct to a tensorflow fct: We will start by making np_d_MEX_32 into a tensorflow function.
There is a function in tensorflow tf.py_func(func, inp, Tout, stateful=stateful, name=name) [doc]
which transforms any numpy function to a tensorflow function, so we can use it:
'''
np_d_MEX_32 = lambda x: np_d_MEX(x=x).astype(np.float32)
def tf_d_MEX(x,name=None):
with tf.name_scope(name, "d_MEX", [x]) as name:
y = tf.py_func(np_d_MEX_32,
[x],
[tf.float32],
name=name,
stateful=False)
return y[0]
'''
tf.py_func acts on lists of tensors (and returns a list of tensors), that is why we have [x] (and return y[0]).
The stateful option is to tell tensorflow whether the function always gives the same output for the same input (stateful = False)
in which case tensorflow can simply the tensorflow graph, this is our case and will probably be the case in most situations.
One thing to be careful of at this point is that numpy used float64 but tensorflow uses float32 so you need to convert
your function to use float32 before you can convert it to a tensorflow function otherwise tensorflow will complain.
This is why we need to make np_d_MEX_32 first.
What about the Gradients? The problem with only doing the above is that even though we now have tf_d_MEX which is the
tensorflow version of np_d_MEX, we couldn't use it as an activation function if we wanted to because tensorflow doesn't
know how to calculate the gradients of that function.
Hack to get Gradients: As explained in the sources mentioned above, there is a hack to define gradients of a function
using tf.RegisterGradient [doc] and tf.Graph.gradient_override_map [doc]. Copying the code from harpone we can modify
the tf.py_func function to make it define the gradient at the same time:
'''
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
'''
Now we are almost done, the only thing is that the grad function we need to pass to the above py_func function needs to
take a special form. It needs to take in an operation, and the previous gradients before the operation and propagate
the gradients backward after the operation.
Gradient Function: So for our MEX activation function that is how we would do it:
'''
def MEXgrad(op, grad):
x = op.inputs[0]
# x = op
n_gr = tf_d_MEX(x)
return grad * n_gr
'''
The activation function has only one input, that is why x = op.inputs[0]. If the operation had many inputs, we would
need to return a tuple, one gradient for each input. For example if the operation was a-bthe gradient with respect to a
is +1 and with respect to b is -1 so we would have return +1*grad,-1*grad. Notice that we need to return tensorflow
functions of the input, that is why need tf_d_MEX, np_d_MEX would not have worked because it cannot act on
tensorflow tensors. Alternatively we could have written the derivative using tensorflow functions:
'''
# Combining it all together: Now that we have all the pieces, we can combine them all together:
np_MEX_32 = lambda x: np_MEX(x=x).astype(np.float32)
def tf_MEX(x, name=None):
with tf.name_scope(name, "MEX",[x]) as name:
y = py_func(np_MEX_32,
[x],
[tf.float32],
name=name,
grad=MEXgrad) # <-- here's the call to the gradient
return y[0]
with tf.Session() as sess:
x = tf.constant([[0.2,0.7,1.2,1.7],[0.2,0.7,1.2,1.7]])
y = tf_MEX(x)
tf.global_variables_initializer().run()
print(x.eval(), y.eval(), tf.gradients(y, [x])[0].eval())
In the console, I have checked that the variables have the "correct" shapes:
x.eval()
Out[9]:
array([[ 0.2 , 0.69999999, 1.20000005, 1.70000005],
[ 0.2 , 0.69999999, 1.20000005, 1.70000005]], dtype=float32)
y.eval()
Out[10]: array([[ 0.83393127, 0.83393127]], dtype=float32)
tf_d_MEX(x).eval()
Out[11]:
array([[ 0.0850958 , 0.19909413, 0.46581003, 0.07051659],
[ 0.0850958 , 0.19909413, 0.46581003, 0.07051659]], dtype=float32)
My bad, I just found the mistake.
Its here:
def MEXgrad(op, grad):
x = op.inputs[0]
# x = op
n_gr = tf_d_MEX(x)
return n_gr
I wonder if there is a typo here, where this mistake is also there.
I'm trying to impelement this article:
http://ronan.collobert.com/pub/matos/2008_deep_icml.pdf
Specfically the equation (3) from section 2.
Shortly I want to do a pairwise distance computation for the features of each mini-batch and insert this loss to the general network loss.
I have only the Tesnor of the batch (16 samples), the labels tensor of the batch and the batch feature Tensor.
After looking for quite a while I still couldn't figure out the following:
1) How do I divide the batch for Positive (i.e. same label) and negative pairs. Since Tensor are not iterateble I can't figure out how to get which sample have which label and then divide my vector, or get which indices of the tensor belong to each class.
2) How can I do pairwise distance calculation for some of the indices in the batch tensor?
3) I also need to define a new distance function for negative examples
Overall, I need to get which indices belong to which class, do a positive pair-wise distace calculation for all positive pairs. And do another calculation for all negative pairs. Then sum it all up and add it to the network loss.
Any help (to one of more of the 3 issues) would be highly appreciated.
1)
You should do the pair sampling before feeding the data into a session. Label every pair a boolean label, say y = 1 for matched-pair, 0 otherwise.
2) 3) Just calculate both pos/neg terms for every pair, and let the 0-1 label y to choose which to add to the loss.
First create placeholders, y_ is for boolean labels.
dim = 64
x1_ = tf.placeholder('float32', shape=(None, dim))
x2_ = tf.placeholder('float32', shape=(None, dim))
y_ = tf.placeholder('uint8', shape=[None]) # uint8 for boolean
Then the loss tensor can be created by the function.
def loss(x1, x2, y):
# Euclidean distance between x1,x2
l2diff = tf.sqrt( tf.reduce_sum(tf.square(tf.sub(x1, x2)),
reduction_indices=1))
# you can try margin parameters
margin = tf.constant(1.)
labels = tf.to_float(y)
match_loss = tf.square(l2diff, 'match_term')
mismatch_loss = tf.maximum(0., tf.sub(margin, tf.square(l2diff)), 'mismatch_term')
# if label is 1, only match_loss will count, otherwise mismatch_loss
loss = tf.add(tf.mul(labels, match_loss), \
tf.mul((1 - labels), mismatch_loss), 'loss_add')
loss_mean = tf.reduce_mean(loss)
return loss_mean
loss_ = loss(x1_, x2_, y_)
Then feed your data (random generated for example):
batchsize = 4
x1 = np.random.rand(batchsize, dim)
x2 = np.random.rand(batchsize, dim)
y = np.array([0,1,1,0])
l = sess.run(loss_, feed_dict={x1_:x1, x2_:x2, y_:y})
Short answer
I think the simplest way to do that is to sample the pairs offline (i.e. outside of the TensorFlow graph).
You create tf.placeholder for a batch of pairs along with their labels (positive or negative, i.e. same class or different class), and then you can compute in TensorFlow the corresponding loss.
With the code
You sample the pairs offline. You sample batch_size pairs of inputs, and output the batch_size left elements of the pairs of shape [batch_size, input_size]. You also output the labels of the pairs (either positive of negative) of shape [batch_size,]
pairs_left = np.zeros((batch_size, input_size))
pairs_right = np.zeros((batch_size, input_size))
labels = np.zeros((batch_size, 1)) # ex: [[0.], [1.], [1.], [0.]] for batch_size=4
Then you create Tensorflow placeholders corresponding to these inputs. In your code, you will feed the previous inputs to these placeholders in the feed_dict argument of sess.run()
pairs_left_node = tf.placeholder(tf.float32, [batch_size, input_size])
pairs_right_node = tf.placeholder(tf.float32, [batch_size, input_size])
labels_node = tf.placeholder(tf.float32, [batch_size, 1])
Now we can perform a feedforward on the inputs (let's say your model is a linear model).
W = ... # shape [input_size, feature_size]
output_left = tf.matmul(pairs_left_node, W) # shape [batch_size, feature_size]
output_right = tf.matmul(pairs_right_node, W) # shape [batch_size, feature_size]
Finally we can compute the pairwise loss.
l2_loss_pairs = tf.reduce_sum(tf.square(output_left - output_right), 1)
positive_loss = l2_loss_pairs
negative_loss = tf.nn.relu(margin - l2_loss_pairs)
final_loss = tf.mul(labels_node, positive_loss) + tf.mul(1. - labels_node, negative_loss)
And that's it ! You can now optimize on this loss, with a good offline sampling.