I'm trying to implement a max margin loss in TensorFlow.
the idea is that I have some positive example and i sample some negative examples and want to compute something like
where B is the size of my batch and N is the number of negative samples I want to use.
I'm new to tensorflow and I'm finding it tricky to implement it.
My model computes a vector of scores of dimension B * (N + 1) where I alternate positive samples and negative samples. For instance, for a batch size of 2 and 2 negative examples I have a vector of size 6 with scores for the first positive example at index 0 and for the second positive example at position 3 and scores for negative examples in position 1, 2, 4 and 5.
The ideal would be to get values like [1, 0, 0, 1, 0, 0].
What I could came up with is the following, using while and conditions:
# Function for computing max margin inner loop
def max_margin_inner(i, batch_examples_t, j, scores, loss):
idx_pos = tf.mul(i, batch_examples_t)
score_pos = tf.gather(scores, idx_pos)
idx_neg = tf.add_n([tf.mul(i, batch_examples_t), j, 1])
score_neg = tf.gather(scores, idx_neg)
loss = tf.add(loss, tf.maximum(0.0, 1.0 - score_pos + score_neg))
tf.add(j, 1)
return [i, batch_examples_t, j, scores, loss]
# Function for computing max margin outer loop
def max_margin_outer(i, batch_examples_t, scores, loss):
j = tf.constant(0)
pos_idx = tf.mul(i, batch_examples_t)
length = tf.gather(tf.shape(scores), 0)
neg_smp_t = tf.constant(num_negative_samples)
cond = lambda i, b, j, bi, lo: tf.logical_and(
tf.less(j, neg_smp_t),
tf.less(pos_idx, length))
tf.while_loop(cond, max_margin_inner, [i, batch_examples_t, j, scores, loss])
tf.add(i, 1)
return [i, batch_examples_t, scores, loss]
# compute the loss
with tf.name_scope('max_margin'):
loss = tf.Variable(0.0, name="loss")
i = tf.constant(0)
batch_examples_t = tf.constant(batch_examples)
condition = lambda i, b, bi, lo: tf.less(i, b)
max_margin = tf.while_loop(
condition,
max_margin_outer,
[i, batch_examples_t, scores, loss])
The code has two loops, one for the outer sum and the other for the inner one. The problem I'm facing is that the loss variable keeps accumulating errors at each iteration without being reset after each iteration. So it actually doesn't work at all.
Moreover, it seems really not in line with tensorflow way of implementing things. I guess there could be better ways, more vectorized ways to implement it, hope someone will suggest options or point me to examples.
First we need to clean the input:
we want an array of positive scores, of shape [B, 1]
we want a matrix of negative scores, of shape [B, N]
import tensorflow as tf
B = 2
N = 2
scores = tf.constant([0.5, 0.2, -0.1, 1., -0.5, 0.3]) # shape B * (N+1)
scores = tf.reshape(scores, [B, N+1])
scores_pos = tf.slice(scores, [0, 0], [B, 1])
scores_neg = tf.slice(scores, [0, 1], [B, N])
Now we only have to compute the matrix of the loss, i.e. all the individual loss for every pair (positive, negative), and compute its sum.
loss_matrix = tf.maximum(0., 1. - scores_pos + scores_neg) # we could also use tf.nn.relu here
loss = tf.reduce_sum(loss_matrix)
Related
I'm trying to calculate a loss value in a variation of multiclass classification.
I have my y tensor (the values correspond to the classes):
y = torch.tensor([ 1, 0, 2])
My y_pred is a 3x3 matrix of probability distributions:
y_pred = torch.tensor([[0.4937, 0.2657, 0.2986],
[0.2553, 0.3845, 0.4384],
[0.2510, 0.3498, 0.2630]])
The complication is that I also have a distance matrix (each class has some distance to other classes):
d_mtx = torch.tensor([[0, 0.7256, 0.7433],
[0.6281, 0, 0.1171],
[0.7580, 0.2513, 0]])
The loss that I'm trying to calculate is:
loss = 0
for class_value in range(len(y)):
dis = torch.dot(d_mtx[y[class_value]], y_pred[class_value])
loss += dis
Is there a way to calculate it efficiently without the iteration?
Update 1:
Tried #Yahia Zakaria approach and it works if my y_pred has the same size as my d_mtx, but otherwise I get an error:
RuntimeError: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0
For example:
y = torch.tensor([ 1, 0, 2, 1])
y_pred = torch.tensor([[0.4937, 0.2657, 0.2986],
[0.2553, 0.3845, 0.4384],
[0.2510, 0.3498, 0.2630],
[0.2510, 0.3498, 0.2630]])
d_mtx = torch.tensor([[0, 0.7256, 0.7433],
[0.6281, 0, 0.1171],
[0.7580, 0.2513, 0]])
You could do it like that:
loss = (d_mtx[y] * y_pred).sum()
This solution assumes the y is of type torch.int64 which is valid for the example you have shown.
I am trying to create a function which can transform a given input sequence to a transition matrix of the requested order. I found an implementation for the first-order Markovian transition matrix.
Now, I want to be able to come up with a solution which can calculate 2nd and 3rd order transition matrices.
Example of the 1st order matrix implementation:
import numpy as np
# sequence with 3 states -> 0, 1, 2
a = [0, 1, 0, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 2]
def transition_matrix_first_order(seq):
M = np.full((3, 3), fill_value = 1/3, dtype= np.float64)
for (i,j) in zip(seq, seq[1:]):
M[i, j] += 1
M = M / M.sum(axis = 1, keepdims = True)
return M
print(transition_matrix_first_order(a))
Which gives me this:
[[0.61111111 0.19444444 0.19444444]
[0.38888889 0.38888889 0.22222222]
[0.22222222 0.22222222 0.55555556]]
When making a 2nd order matrix, it should have unique_state_count ** order rows and unique_state_count columns. In the example above, I have 3 unique states, so the matrix will have 9x3 structure.
Desirable function sample:
cal_tr_matrix(seq, unique_state_count, order)
I think you have a slight misunderstanding about the Markov chains and their transition matrices.
First of all, the estimated transition matrix your function produces is unfortunately not correct. Why? Let's refresh.
A discrete Markov chain in discrete time with N different states has a transition matrix P of size N x N, where a (i, j) element is P(X_1=j|X_0=i), i.e. the probability of transition from state i to state j in a single time step.
Now a transition matrix of order n, denoted P^{n}is once again a matrix of size N x N where a (i, j) element is P(X_n=j|X_0=i), i.e. the probability of transition from state i to state j in n time steps.
A wonderful result says: P^{n} = P^n, i.e. taking n powers of single-step transition matrix gives you the n-step transition matrix.
Now with this recap, all that is needed is to estimate P from the given sequence, then to estimate P^{n} one can just use the already estimated P and take a n-th power of the matrix. So how to estimate the matrix P? Well if we denote N_{ij} the number of observations of transition from state i to state j and N_{i*} the number of observations being in state i, then P_{ij} = N_{ij} / N_{i*}.
Overall here in Python:
import numpy as np
def transition_matrix(arr, n=1):
""""
Computes the transition matrix from Markov chain sequence of order `n`.
:param arr: Discrete Markov chain state sequence in discrete time with states in 0, ..., N
:param n: Transition order
"""
M = np.zeros(shape=(max(arr) + 1, max(arr) + 1))
for (i, j) in zip(arr, arr[1:]):
M[i, j] += 1
T = (M.T / M.sum(axis=1)).T
return np.linalg.matrix_power(T, n)
transition_matrix(arr=a, n=1)
>>> array([[0.63636364, 0.18181818, 0.18181818],
>>> [0.4 , 0.4 , 0.2 ],
>>> [0.2 , 0.2 , 0.6 ]])
transition_matrix(arr=a, n=2)
>>> array([[0.51404959, 0.22479339, 0.26115702],
>>> [0.45454545, 0.27272727, 0.27272727],
>>> [0.32727273, 0.23636364, 0.43636364]])
transition_matrix(arr=a, n=3)
>>> array([[0.46927122, 0.23561232, 0.29511645],
>>> [0.45289256, 0.24628099, 0.30082645],
>>> [0.39008264, 0.24132231, 0.36859504]])
Interesting thing, when you set the order n to a fairly high number, the higher and higher powers of the P matrix seem to converge to some very specific values. That's known as stationary/invariant distribution of the Markov chain and it gives a very good indication of how the chain behaves over a long period of time/transitions. Also:
P = transition_matrix(a, 1)
P111 = transition_matrix(a, 111)
print(P)
print(P111.dot(P))
EDIT: Now to the tweaked solution based on your comment, I'd suggest to have higher dimensional matrices for higher orders instead of exploding the number of rows. One way would be like this:
def cal_tr_matrix(arr, order):
_shape = (max(arr) + 1,) * (order + 1)
M = np.zeros(_shape)
for _ind in zip(*[arr[_x:] for _x in range(order + 1)]):
M[_ind] += 1
return M
res1 = cal_tr_matrix(a, 1)
res2 = cal_tr_matrix(a, 2)
Now the element res1[i, j] says how many times transition i->j happened, while the element res2[i, j, k] says how many times transition i->j->k happened.
I'm implementing a backward HMM algorithm in PyTorch. I used this link as reference. This link contains the results of the numerical example used (I am attempting to implement that and compare my generated results to it). Page 3, section 2. Backward probability, there is a table containing the calculated results.
Here is my code:
# Initial Transition matrix as shown in page 2 of above link
A = np.array([[0.6, 0.4], [0.3, 0.7]])
A = torch.from_numpy(A)
# Initial State Probability (page 2)
pi = np.array([0.8, 0.2])
pi = torch.from_numpy(pi)
# Output probabilities (page 2)
emission_matrix = np.array([[0.3, 0.4, 0.3, 0.3], [0.4, 0.3, 0.3, 0.3]])
emission_matrix = torch.from_numpy(emission_matrix)
# Initialize empty 2x4 matrix (dimensions of emission matrix)
backward = torch.zeros(emission_matrix.shape, dtype=torch.float64)
# Backward algorithm
def _backward(emission_matrix):
# Initialization: A(i, j) * B(T, i) * B(Ot+1, j) , where B(Ot+1, j) = 1
backward[:, -1] = torch.matmul(A, emission_matrix[:, -1])
# I reversed the emission matrix so as to start from the last column
rev_emission_mat = torch.flip(emission_matrix[:, :-1], [1])
# I transposed the reversed emission matrix such that each iterable in the for
# loop is the observation sequence probability
T_rev_emission_mat = torch.transpose(rev_emission_mat, 1, 0)
# This step is so that I assign a reverse index enumeration to each iterable in the
# emission matrix starts from time T to 0, rather than the opposite
zipped_cols = list(zip(range(len(T_rev_emission_mat)-1, -1, -1), T_rev_emission_mat))
for i, obs_prob in zipped_cols:
# Induction: Σ A(i, j) * B(j)(Ot+1) * β(t+1, j)
if i != 0:
backward[:, i] = torch.matmul(A * obs_prob, backward[:, i+1])
# Termination: Σ π(i) * bi * β(1, i)
backward[:, 0] = torch.matmul(pi * obs_prob, backward[:, 1])
# run backward algorithm
_backward(emission_matrix)
# check results, backward is an all zero matrix that was initialized above
print(backward)
>>> tensor([[0.0102, 0.0324, 0.0900, 0.3000],
[0.0102, 0.0297, 0.0900, 0.3000]], dtype=torch.float64)
As you can see, the 0-th index does not match the result in page 3 of the previous link. What did I do wrong? If there is anything I can clarify, please let me know. Thanks in advance!
backward[:, 0] = pi * obs_prob * backward[:, 1]
Although there are many references showing how to register a gradient, but I'm still not very clear what exactly kind of gradient need to be defined.
Some similar topics:
How to register a custom gradient for a operation composed of tf operations
How Can I Define Only the Gradient for a Tensorflow Subgraph?
Okay, here comes my question:
I have a forward function y = f(A,B), where the size of each of them are:
y: (batch_size, m, n)
A: (batch_size, a, a)
B: (batch_size, b, b)
Suppose I can write down the mathematical partial derivatives of every element of y with respect every element of A and B. dy/dA, dy/dB. My question is what should I return in the gradient function?
#ops.RegisterGradient("f")
def f_grad(op, grad):
...
return ???, ???
Here says that The result of the gradient function must be a list of Tensor objects representing the gradients with respect to each input.
It is very easy to understand the gradient to be defined when y is scalar and A, B are matrix. But when y is matrix and A, B are also matrix, what should that gradient be?
tf.gradients computes the gradient of the sum of each output tensor with respect to each value in the input tensors. A gradient operation receives the op for which you are computing the gradient, op, and the gradient accumulated at this point, grad. In your example, grad would be a tensor with the same shape as y, and each value would be the gradient of the corresponding value in y - that is, if grad[0, 0] == 2, it means that increasing y[0, 0] by 1 will increase the sum of the output tensor by 2 (I know, you probably are already clear on this). Now you have to compute the same thing for A and B. Let's say you figure out that increasing A[2, 3] by 1 will increase y[0, 0] by 3 and have no effect over any other value in y. That means that would increase the sum of the output value by 3 × 2 = 6, so the gradient for A[2, 3] would be 6.
As an example, let's take the gradient of the matrix multiplication (op MatMul), which you can find in tensorflow/python/ops/math_grad.py:
#ops.RegisterGradient("MatMul")
def _MatMulGrad(op, grad):
"""Gradient for MatMul."""
t_a = op.get_attr("transpose_a")
t_b = op.get_attr("transpose_b")
a = math_ops.conj(op.inputs[0])
b = math_ops.conj(op.inputs[1])
if not t_a and not t_b:
grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
grad_b = gen_math_ops.mat_mul(a, grad, transpose_a=True)
elif not t_a and t_b:
grad_a = gen_math_ops.mat_mul(grad, b)
grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True)
elif t_a and not t_b:
grad_a = gen_math_ops.mat_mul(b, grad, transpose_b=True)
grad_b = gen_math_ops.mat_mul(a, grad)
elif t_a and t_b:
grad_a = gen_math_ops.mat_mul(b, grad, transpose_a=True, transpose_b=True)
grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True, transpose_b=True)
return grad_a, grad_b
We will focus on the case where transpose_a and transpose_b are both False, and so we are in the first branch , if not t_a and not t_b: (also ignore the conj, which is meant for complex values). 'a' and 'b' are the operands here and, as said before, grad has the gradient of the sum of the output with respect to each value in the multiplication result. So how would things change if I increase a[0, 0] by one? Basically, each element in the first row of the product matrix would be increased by the values in the first row of b. So the gradient for a[0, 0] is the dot product of the first row of b and the first row of grad - that is, how much I would increase each output value multiplied by the accumulated gradient of each of these. If you think about it, the line grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True) is doing exactly that. grad_a[0, 0] will be the dot product of the first row of grad and the first row of b (because we are transposing b here), and, in general, grad_a[i, j] will be the dot product of the i-th row of grad and the j-th row of b. You can follow a similar reasoning for grad_b too.
EDIT:
As an example, see how tf.gradients and the registered gradient relate to each other:
import tensorflow as tf
# Import gradient registry to lookup gradient functions
from tensorflow.python.framework.ops import _gradient_registry
# Gradient function for matrix multiplication
matmul_grad = _gradient_registry.lookup('MatMul')
# A matrix multiplication
a = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
b = tf.constant([[6, 7, 8], [9, 10, 11]], dtype=tf.float32)
c = tf.matmul(a, b)
# Gradient of sum(c) wrt each element of a
grad_c_a_1, = tf.gradients(c, a)
# The same is obtained by backpropagating an all-ones matrix
grad_c_a_2, _ = matmul_grad(c.op, tf.ones_like(c))
# Multiply each element of c by itself, but stopping the gradients
# This should scale the gradients by the values of c
cc = c * tf.stop_gradient(c)
# Regular gradients computation
grad_cc_a_1, = tf.gradients(cc, a)
# Gradients function called with c as backpropagated gradients
grad_cc_a_2, _ = matmul_grad(c.op, c)
with tf.Session() as sess:
print('a:')
print(sess.run(a))
print('b:')
print(sess.run(b))
print('c = a * b:')
print(sess.run(c))
print('tf.gradients(c, a)[0]:')
print(sess.run(grad_c_a_1))
print('matmul_grad(c.op, tf.ones_like(c))[0]:')
print(sess.run(grad_c_a_2))
print('tf.gradients(c * tf.stop_gradient(c), a)[0]:')
print(sess.run(grad_cc_a_1))
print('matmul_grad(c.op, c)[0]:')
print(sess.run(grad_cc_a_2))
Output:
a:
[[1. 2.]
[3. 4.]]
b:
[[ 6. 7. 8.]
[ 9. 10. 11.]]
c = a * b:
[[24. 27. 30.]
[54. 61. 68.]]
tf.gradients(c, a)[0]:
[[21. 30.]
[21. 30.]]
matmul_grad(c.op, tf.ones_like(c))[0]:
[[21. 30.]
[21. 30.]]
tf.gradients(c * tf.stop_gradient(c), a)[0]:
[[ 573. 816.]
[1295. 1844.]]
matmul_grad(c.op, c)[0]:
[[ 573. 816.]
[1295. 1844.]]
I want to compute the pairwise square distance of a batch of feature in Tensorflow. I have a simple implementation using + and * operations by
tiling the original tensor :
def pairwise_l2_norm2(x, y, scope=None):
with tf.op_scope([x, y], scope, 'pairwise_l2_norm2'):
size_x = tf.shape(x)[0]
size_y = tf.shape(y)[0]
xx = tf.expand_dims(x, -1)
xx = tf.tile(xx, tf.pack([1, 1, size_y]))
yy = tf.expand_dims(y, -1)
yy = tf.tile(yy, tf.pack([1, 1, size_x]))
yy = tf.transpose(yy, perm=[2, 1, 0])
diff = tf.sub(xx, yy)
square_diff = tf.square(diff)
square_dist = tf.reduce_sum(square_diff, 1)
return square_dist
This function takes as input two matrices of size (m,d) and (n,d) and compute the squared distance between each row vector. The output is a matrix of size (m,n) with element 'd_ij = dist(x_i, y_j)'.
The problem is that I have a large batch and high dim features 'm, n, d' replicating the tensor consume a lot of memory.
I'm looking for another way to implement this without increasing the memory usage and just only store the final distance tensor. Kind of double looping the original tensor.
You can use some linear algebra to turn it into matrix ops. Note that what you need matrix D where a[i] is the ith row of your original matrix and
D[i,j] = (a[i]-a[j])(a[i]-a[j])'
You can rewrite that into
D[i,j] = r[i] - 2 a[i]a[j]' + r[j]
Where r[i] is squared norm of ith row of the original matrix.
In a system that supports standard broadcasting rules you can treat r as a column vector and write D as
D = r - 2 A A' + r'
In TensorFlow you could write this as
A = tf.constant([[1, 1], [2, 2], [3, 3]])
r = tf.reduce_sum(A*A, 1)
# turn r into column vector
r = tf.reshape(r, [-1, 1])
D = r - 2*tf.matmul(A, tf.transpose(A)) + tf.transpose(r)
sess = tf.Session()
sess.run(D)
result
array([[0, 2, 8],
[2, 0, 2],
[8, 2, 0]], dtype=int32)
Using squared_difference:
def squared_dist(A):
expanded_a = tf.expand_dims(A, 1)
expanded_b = tf.expand_dims(A, 0)
distances = tf.reduce_sum(tf.squared_difference(expanded_a, expanded_b), 2)
return distances
One thing I noticed is that this solution using tf.squared_difference gives me out of memory (OOM) for very large vectors, while the approach by #YaroslavBulatov doesn't. So, I think decomposing the operation yields a smaller memory footprint (which I thought squared_difference would handle better under the hood).
Here is a more general solution for two tensors of coordinates A and B:
def squared_dist(A, B):
assert A.shape.as_list() == B.shape.as_list()
row_norms_A = tf.reduce_sum(tf.square(A), axis=1)
row_norms_A = tf.reshape(row_norms_A, [-1, 1]) # Column vector.
row_norms_B = tf.reduce_sum(tf.square(B), axis=1)
row_norms_B = tf.reshape(row_norms_B, [1, -1]) # Row vector.
return row_norms_A - 2 * tf.matmul(A, tf.transpose(B)) + row_norms_B
Note that this is the square distance. If you want to change this to the Euclidean distance, perform a tf.sqrt on the result. If you want to do that, don't forget to add a small constant to compensate for the floating point instabilities: dist = tf.sqrt(squared_dist(A, B) + 1e-6).
If you want compute other method , then change the order of the tf modules.
def compute_euclidean_distance(x, y):
size_x = x.shape.dims[0]
size_y = y.shape.dims[0]
for i in range(size_x):
tile_one = tf.reshape(tf.tile(x[i], [size_y]), [size_y, -1])
eu_one = tf.expand_dims(tf.sqrt(tf.reduce_sum(tf.pow(tf.subtract(tile_one, y), 2), axis=1)), axis=0)
if i == 0:
d = eu_one
else:
d = tf.concat([d, eu_one], axis=0)
return d