Tensorflow: What gradients needed to be defined for custom operation?

Tensorflow: What gradients needed to be defined for custom operation? - python

Although there are many references showing how to register a gradient, but I'm still not very clear what exactly kind of gradient need to be defined.
Some similar topics:
How to register a custom gradient for a operation composed of tf operations
How Can I Define Only the Gradient for a Tensorflow Subgraph?
Okay, here comes my question:
I have a forward function y = f(A,B), where the size of each of them are:
y: (batch_size, m, n)
A: (batch_size, a, a)
B: (batch_size, b, b)
Suppose I can write down the mathematical partial derivatives of every element of y with respect every element of A and B. dy/dA, dy/dB. My question is what should I return in the gradient function?
#ops.RegisterGradient("f")
def f_grad(op, grad):
...
return ???, ???
Here says that The result of the gradient function must be a list of Tensor objects representing the gradients with respect to each input.
It is very easy to understand the gradient to be defined when y is scalar and A, B are matrix. But when y is matrix and A, B are also matrix, what should that gradient be?

tf.gradients computes the gradient of the sum of each output tensor with respect to each value in the input tensors. A gradient operation receives the op for which you are computing the gradient, op, and the gradient accumulated at this point, grad. In your example, grad would be a tensor with the same shape as y, and each value would be the gradient of the corresponding value in y - that is, if grad[0, 0] == 2, it means that increasing y[0, 0] by 1 will increase the sum of the output tensor by 2 (I know, you probably are already clear on this). Now you have to compute the same thing for A and B. Let's say you figure out that increasing A[2, 3] by 1 will increase y[0, 0] by 3 and have no effect over any other value in y. That means that would increase the sum of the output value by 3 × 2 = 6, so the gradient for A[2, 3] would be 6.
As an example, let's take the gradient of the matrix multiplication (op MatMul), which you can find in tensorflow/python/ops/math_grad.py:
#ops.RegisterGradient("MatMul")
def _MatMulGrad(op, grad):
"""Gradient for MatMul."""
t_a = op.get_attr("transpose_a")
t_b = op.get_attr("transpose_b")
a = math_ops.conj(op.inputs[0])
b = math_ops.conj(op.inputs[1])
if not t_a and not t_b:
grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
grad_b = gen_math_ops.mat_mul(a, grad, transpose_a=True)
elif not t_a and t_b:
grad_a = gen_math_ops.mat_mul(grad, b)
grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True)
elif t_a and not t_b:
grad_a = gen_math_ops.mat_mul(b, grad, transpose_b=True)
grad_b = gen_math_ops.mat_mul(a, grad)
elif t_a and t_b:
grad_a = gen_math_ops.mat_mul(b, grad, transpose_a=True, transpose_b=True)
grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True, transpose_b=True)
return grad_a, grad_b
We will focus on the case where transpose_a and transpose_b are both False, and so we are in the first branch , if not t_a and not t_b: (also ignore the conj, which is meant for complex values). 'a' and 'b' are the operands here and, as said before, grad has the gradient of the sum of the output with respect to each value in the multiplication result. So how would things change if I increase a[0, 0] by one? Basically, each element in the first row of the product matrix would be increased by the values in the first row of b. So the gradient for a[0, 0] is the dot product of the first row of b and the first row of grad - that is, how much I would increase each output value multiplied by the accumulated gradient of each of these. If you think about it, the line grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True) is doing exactly that. grad_a[0, 0] will be the dot product of the first row of grad and the first row of b (because we are transposing b here), and, in general, grad_a[i, j] will be the dot product of the i-th row of grad and the j-th row of b. You can follow a similar reasoning for grad_b too.
EDIT:
As an example, see how tf.gradients and the registered gradient relate to each other:
import tensorflow as tf
# Import gradient registry to lookup gradient functions
from tensorflow.python.framework.ops import _gradient_registry
# Gradient function for matrix multiplication
matmul_grad = _gradient_registry.lookup('MatMul')
# A matrix multiplication
a = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
b = tf.constant([[6, 7, 8], [9, 10, 11]], dtype=tf.float32)
c = tf.matmul(a, b)
# Gradient of sum(c) wrt each element of a
grad_c_a_1, = tf.gradients(c, a)
# The same is obtained by backpropagating an all-ones matrix
grad_c_a_2, _ = matmul_grad(c.op, tf.ones_like(c))
# Multiply each element of c by itself, but stopping the gradients
# This should scale the gradients by the values of c
cc = c * tf.stop_gradient(c)
# Regular gradients computation
grad_cc_a_1, = tf.gradients(cc, a)
# Gradients function called with c as backpropagated gradients
grad_cc_a_2, _ = matmul_grad(c.op, c)
with tf.Session() as sess:
print('a:')
print(sess.run(a))
print('b:')
print(sess.run(b))
print('c = a * b:')
print(sess.run(c))
print('tf.gradients(c, a)[0]:')
print(sess.run(grad_c_a_1))
print('matmul_grad(c.op, tf.ones_like(c))[0]:')
print(sess.run(grad_c_a_2))
print('tf.gradients(c * tf.stop_gradient(c), a)[0]:')
print(sess.run(grad_cc_a_1))
print('matmul_grad(c.op, c)[0]:')
print(sess.run(grad_cc_a_2))
Output:
a:
[[1. 2.]
[3. 4.]]
b:
[[ 6. 7. 8.]
[ 9. 10. 11.]]
c = a * b:
[[24. 27. 30.]
[54. 61. 68.]]
tf.gradients(c, a)[0]:
[[21. 30.]
[21. 30.]]
matmul_grad(c.op, tf.ones_like(c))[0]:
[[21. 30.]
[21. 30.]]
tf.gradients(c * tf.stop_gradient(c), a)[0]:
[[ 573. 816.]
[1295. 1844.]]
matmul_grad(c.op, c)[0]:
[[ 573. 816.]
[1295. 1844.]]

Related

python tensorflow l2 loss over axis

I am using python 3 with tensorflow
I have a matrix, each row is a vector, I want to get a distance matrix - that is computer using the l2 norm loss, each value in the matrix will be a distance between two vectors
e.g
Dij = l2_distance(M(i,:), Mj(j,:))
Thanks
edit:
this is not a duplicate of that other question is about computing the norm for the each row of a matrix, I need the pairwise norm distance between each row to every other row.

This answer shows how to compute the pair-wise sum of squared differences between a collection of vectors. By simply post-composing with the square root, you arrive at your desired pair-wise distances:
M = tf.constant([[0, 0], [2, 2], [5, 5]], dtype=tf.float64)
r = tf.reduce_sum(M*M, 1)
r = tf.reshape(r, [-1, 1])
D2 = r - 2*tf.matmul(M, tf.transpose(M)) + tf.transpose(r)
D = tf.sqrt(D2)
with tf.Session() as sess:
print(sess.run(D))
# [[0. 2.82842712 7.07106781]
# [2.82842712 0. 4.24264069]
# [7.07106781 4.24264069 0. ]]

You can write a TensorFlow operation based on the formula of Euclidean distance (L2 loss).
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x1, x2))))
Sample would be
import tensorflow as tf
x1 = tf.constant([1, 2, 3], dtype=tf.float32)
x2 = tf.constant([4, 5, 6], dtype=tf.float32)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x1, x2))))
with tf.Session() as sess:
print(sess.run(distance))
As pointed out by #fuglede, if you want to output the pairwise distances, then we can use
tf.sqrt(tf.square(tf.subtract(x1, x2)))

Squared Mahalanobis distance function in Python returning array - why?

The code is:
import numpy as np
def Mahalanobis(x, covariance_matrix, mean):
x = np.array(x)
mean = np.array(mean)
covariance_matrix = np.array(covariance_matrix)
return (x-mean)*np.linalg.inv(covariance_matrix)*(x.transpose()-mean.transpose())
#variables x and mean are 1xd arrays; covariance_matrix is a dxd matrix
#the 1xd array passed to x should be multiplied by the (inverted) dxd array
#that was passed into the second argument
#the resulting 1xd matrix is to be multiplied by a dx1 matrix, the transpose of
#[x-mean], which should result in a 1x1 array (a number)
But for some reason I get a matrix for my output when I enter the parameters
Mahalanobis([2,5], [[.5,0],[0,2]], [3,6])
output:
out[]: array([[ 2. , 0. ],
[ 0. , 0.5]])
It seems my function is just giving me the inverse of the 2x2 matrix that I input in the 2nd argument.

You've made the classic mistake of assuming that the * operator is doing matrix multiplication. This is not true in Python/numpy (see http://www.scipy-lectures.org/intro/numpy/operations.html and https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html). I broke it down into intermediate steps and used the dot function
import numpy as np
def Mahalanobis(x, covariance_matrix, mean):
x = np.array(x)
mean = np.array(mean)
covariance_matrix = np.array(covariance_matrix)
t1 = (x-mean)
print(f'Term 1 {t1}')
icov = np.linalg.inv(covariance_matrix)
print(f'Inverse covariance {icov}')
t2 = (x.transpose()-mean.transpose())
print(f'Term 2 {t2}')
mahal = t1.dot(icov.dot(t2))
#return (x-mean)*np.linalg.inv(covariance_matrix).dot(x.transpose()-mean.transpose())
return mahal
#variables x and mean are 1xd arrays; covariance_matrix is a dxd matrix
#the 1xd array passed to x should be multiplied by the (inverted) dxd array
#that was passed into the second argument
#the resulting 1xd matrix is to be multiplied by a dx1 matrix, the transpose of
#[x-mean], which should result in a 1x1 array (a number)
Mahalanobis([2,5], [[.5,0],[0,2]], [3,6])
produces
Term 1 [-1 -1]
Inverse covariance [[2. 0. ]
[0. 0.5]]
Term 2 [-1 -1]
Out[9]: 2.5

One can use scipy's mahalanobis() function to verify:
import scipy.spatial, numpy as np
scipy.spatial.distance.mahalanobis([2,5], [3,6], np.linalg.inv([[.5,0],[0,2]]))
# 1.5811388300841898
1.5811388300841898**2 # squared Mahalanobis distance
# 2.5000000000000004
def Mahalanobis(x, covariance_matrix, mean):
x, m, C = np.array(x), np.array(mean), np.array(covariance_matrix)
return (x-m)#np.linalg.inv(C)#(x-m).T
Mahalanobis([2,5], [[.5,0],[0,2]], [3,6])
# 2.5
np.isclose(
scipy.spatial.distance.mahalanobis([2,5], [3,6], np.linalg.inv([[.5,0],[0,2]]))**2,
Mahalanobis([2,5], [[.5,0],[0,2]], [3,6])
)
# True

How to transform vector into unit vector in Tensorflow

This is a pretty simple question that I just can't seem to figure out. I am working with an an output tensor of shape [100, 250]. I want to be able to access the 250 Dimensional array at any spot along the hundred and modify them separately. The tensorflow mathematical tools that I've found either do element-wise modification or scalar modification on the entire tensor. However, I am trying to do scalar modification on subsets of the tensor.
EDIT:
Here is the numpy code that I would like to recreate with tensorflow methods:
update = sess.run(y, feed_dict={x: batch_xs})
for i in range(len(update)):
update[i] = update[i]/np.sqrt(np.sum(np.square(update[i])))
update[i] = update[i] * magnitude
This for loop follows this formula in 250-D instead of 3-D
. I then multiply each unit vector by magnitude to re-shape it to my desired length.
So update here is the numpy [100, 250] dimensional output. I want to transform each 250 dimensional vector into its unit vector. That way I can change its length to a magnitude of my choosing. Using this numpy code, if I run my train_step and pass update into one of my placeholders
sess.run(train_step, feed_dict={x: batch_xs, prediction: output})
it returns the error:
No gradients provided for any variable
This is because I've done the math in numpy and ported it back into tensorflow. Here is a related stackoverflow question that did not get answered.
the tf.nn.l2_normalize is very close to what I am looking for, but it divides by the square root of the maximum sum of squares. Whereas I am trying to divide each vector by its own sum of squares.
Thanks!

There is no real trick here, you can do as in numpy.
The only thing to make sure is that norm is of shape [100, 1] so that it broadcasts well in the division x / norm.
x = tf.ones([100, 250])
norm = tf.sqrt(tf.reduce_sum(tf.square(x), axis=1, keepdims=True))
assert norm.shape == [100, 1]
res = x / norm

You can user tf.norm to get the square root of the sum of squares. (tf version == 1.4 in my code.)
Example code:
import tensorflow as tf
a = tf.random_uniform((3, 4))
b = tf.norm(a, keep_dims=True)
c = tf.norm(a, axis=1, keep_dims=True)
d = a / c
e = a / tf.sqrt(tf.reduce_sum(tf.square(a), axis=1, keep_dims=True) + 1e-8)
f = a / tf.sqrt(tf.reduce_sum(tf.square(a), axis=1, keep_dims=True))
g = tf.sqrt(tf.reduce_sum(tf.square(a), axis=1, keep_dims=True))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
a_eval, b_eval, c_eval, d_eval, e_eval, f_eval, g_eval = sess.run([a, b, c, d, e, f, g])
print(a_eval)
print(b_eval)
print(c_eval)
print(d_eval)
print(e_eval)
print(f_eval)
print(g_eval)
output:
[[ 0.29823065 0.76523042 0.40478575 0.44568062]
[ 0.0222317 0.12344956 0.39582515 0.66143286]
[ 0.01351094 0.38285756 0.46898723 0.34417391]]
[[ 1.4601624]]
[[ 1.01833284]
[ 0.78096414]
[ 0.6965394 ]]
[[ 0.29286167 0.75145411 0.39749849 0.43765712]
[ 0.02846699 0.15807328 0.50684166 0.84694397]
[ 0.01939724 0.54965669 0.6733104 0.49411979]]
[[ 0.29286167 0.75145411 0.39749849 0.43765712]
[ 0.02846699 0.15807328 0.50684166 0.84694397]
[ 0.01939724 0.54965669 0.6733104 0.49411979]]
[[ 0.29286167 0.75145411 0.39749849 0.43765712]
[ 0.02846699 0.15807328 0.50684166 0.84694397]
[ 0.01939724 0.54965669 0.6733104 0.49411979]]
[[ 1.01833284]
[ 0.78096414]
[ 0.6965394 ]]
You can see that there's no difference between a / tf.norm(a, axis=1, keep_dims=True) and a / tf.sqrt(tf.reduce_sum(tf.square(a), axis=1, keep_dims=True) + 1e-8).
a / tf.sqrt(tf.reduce_sum(tf.square(a), axis=1, keep_dims=True) + 1e-8) is preferred because it can handle zero case.

Max margin loss in TensorFlow

I'm trying to implement a max margin loss in TensorFlow.
the idea is that I have some positive example and i sample some negative examples and want to compute something like
where B is the size of my batch and N is the number of negative samples I want to use.
I'm new to tensorflow and I'm finding it tricky to implement it.
My model computes a vector of scores of dimension B * (N + 1) where I alternate positive samples and negative samples. For instance, for a batch size of 2 and 2 negative examples I have a vector of size 6 with scores for the first positive example at index 0 and for the second positive example at position 3 and scores for negative examples in position 1, 2, 4 and 5.
The ideal would be to get values like [1, 0, 0, 1, 0, 0].
What I could came up with is the following, using while and conditions:
# Function for computing max margin inner loop
def max_margin_inner(i, batch_examples_t, j, scores, loss):
idx_pos = tf.mul(i, batch_examples_t)
score_pos = tf.gather(scores, idx_pos)
idx_neg = tf.add_n([tf.mul(i, batch_examples_t), j, 1])
score_neg = tf.gather(scores, idx_neg)
loss = tf.add(loss, tf.maximum(0.0, 1.0 - score_pos + score_neg))
tf.add(j, 1)
return [i, batch_examples_t, j, scores, loss]
# Function for computing max margin outer loop
def max_margin_outer(i, batch_examples_t, scores, loss):
j = tf.constant(0)
pos_idx = tf.mul(i, batch_examples_t)
length = tf.gather(tf.shape(scores), 0)
neg_smp_t = tf.constant(num_negative_samples)
cond = lambda i, b, j, bi, lo: tf.logical_and(
tf.less(j, neg_smp_t),
tf.less(pos_idx, length))
tf.while_loop(cond, max_margin_inner, [i, batch_examples_t, j, scores, loss])
tf.add(i, 1)
return [i, batch_examples_t, scores, loss]
# compute the loss
with tf.name_scope('max_margin'):
loss = tf.Variable(0.0, name="loss")
i = tf.constant(0)
batch_examples_t = tf.constant(batch_examples)
condition = lambda i, b, bi, lo: tf.less(i, b)
max_margin = tf.while_loop(
condition,
max_margin_outer,
[i, batch_examples_t, scores, loss])
The code has two loops, one for the outer sum and the other for the inner one. The problem I'm facing is that the loss variable keeps accumulating errors at each iteration without being reset after each iteration. So it actually doesn't work at all.
Moreover, it seems really not in line with tensorflow way of implementing things. I guess there could be better ways, more vectorized ways to implement it, hope someone will suggest options or point me to examples.

First we need to clean the input:
we want an array of positive scores, of shape [B, 1]
we want a matrix of negative scores, of shape [B, N]
import tensorflow as tf
B = 2
N = 2
scores = tf.constant([0.5, 0.2, -0.1, 1., -0.5, 0.3]) # shape B * (N+1)
scores = tf.reshape(scores, [B, N+1])
scores_pos = tf.slice(scores, [0, 0], [B, 1])
scores_neg = tf.slice(scores, [0, 1], [B, N])
Now we only have to compute the matrix of the loss, i.e. all the individual loss for every pair (positive, negative), and compute its sum.
loss_matrix = tf.maximum(0., 1. - scores_pos + scores_neg) # we could also use tf.nn.relu here
loss = tf.reduce_sum(loss_matrix)

How to implement the Softmax function in Python

From the Udacity's deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:
Where S(y_i) is the softmax function of y_i and e is the exponential and j is the no. of columns in the input vector Y.
I've tried the following:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))
which returns:
[ 0.8360188 0.11314284 0.05083836]
But the suggested solution was:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.
Can someone show mathematically why? Is one correct and the other one wrong?
Are the implementation similar in terms of code and time complexity? Which is more efficient?

They're both correct, but yours is preferred from the point of view of numerical stability.
You start with
e ^ (x - max(x)) / sum(e^(x - max(x))
By using the fact that a^(b - c) = (a^b)/(a^c) we have
= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))
= e ^ x / sum(e ^ x)
Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.

(Well... much confusion here, both in the question and in the answers...)
To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.
Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let's try your solution (your_softmax) and one where the only difference is the axis argument:
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
As I said, for a 1-D score array, the results are indeed identical:
scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
print(softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True, True, True], dtype=bool)
Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:
scores2D = np.array([[1, 2, 3, 6],
[2, 4, 5, 6],
[3, 8, 7, 6]])
print(your_softmax(scores2D))
# [[ 4.89907947e-04 1.33170787e-03 3.61995731e-03 7.27087861e-02]
# [ 1.33170787e-03 9.84006416e-03 2.67480676e-02 7.27087861e-02]
# [ 3.61995731e-03 5.37249300e-01 1.97642972e-01 7.27087861e-02]]
print(softmax(scores2D))
# [[ 0.09003057 0.00242826 0.01587624 0.33333333]
# [ 0.24472847 0.01794253 0.11731043 0.33333333]
# [ 0.66524096 0.97962921 0.86681333 0.33333333]]
The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.
So, all the fuss was actually for an implementation detail - the axis argument. According to the numpy.sum documentation:
The default, axis=None, will sum all of the elements of the input array
while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...
The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some other answers here).

So, this is really a comment to desertnaut's answer but I can't comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut's solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# desertnaut solution (copied from his answer):
def desertnaut_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
# my (correct) solution:
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
Lets take desertnauts example:
x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)
This is the output:
your_softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
desertnaut_softmax(x1)
array([[ 1., 1., 1., 1.]])
softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).
Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.
x2 = np.array([[1, 2, 3, 6], # sample 1
[2, 4, 5, 6], # sample 2
[1, 2, 3, 6]]) # sample 1 again(!)
This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!
your_softmax(x2)
array([[ 0.00183535, 0.00498899, 0.01356148, 0.27238963],
[ 0.00498899, 0.03686393, 0.10020655, 0.27238963],
[ 0.00183535, 0.00498899, 0.01356148, 0.27238963]])
desertnaut_softmax(x2)
array([[ 0.21194156, 0.10650698, 0.10650698, 0.33333333],
[ 0.57611688, 0.78698604, 0.78698604, 0.33333333],
[ 0.21194156, 0.10650698, 0.10650698, 0.33333333]])
softmax(x2)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047],
[ 0.01203764, 0.08894682, 0.24178252, 0.65723302],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
I hope you can see that this is only the case with my solution.
softmax(x1) == softmax(x2)[0]
array([[ True, True, True, True]], dtype=bool)
softmax(x1) == softmax(x2)[2]
array([[ True, True, True, True]], dtype=bool)
Additionally, here is the results of TensorFlows softmax implementation:
import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})
And the result:
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037045],
[ 0.01203764, 0.08894681, 0.24178252, 0.657233 ],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037045]], dtype=float32)

I would say that while both are correct mathematically, implementation-wise, first one is better. When computing softmax, the intermediate values may become very large. Dividing two large numbers can be numerically unstable. These notes (from Stanford) mention a normalization trick which is essentially what you are doing.

sklearn also offers implementation of softmax
from sklearn.utils.extmath import softmax
import numpy as np
x = np.array([[ 0.50839931, 0.49767588, 0.51260159]])
softmax(x)
# output
array([[ 0.3340521 , 0.33048906, 0.33545884]])

From mathematical point of view both sides are equal.
And you can easily prove this. Let's m=max(x). Now your function softmax returns a vector, whose i-th coordinate is equal to
notice that this works for any m, because for all (even complex) numbers e^m != 0
from computational complexity point of view they are also equivalent and both run in O(n) time, where n is the size of a vector.
from numerical stability point of view, the first solution is preferred, because e^x grows very fast and even for pretty small values of x it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feed x = np.array([1000, 5]) into both of your functions. One will return correct probability, the second will overflow with nan
your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use sum(axis=0)

EDIT. As of version 1.2.0, scipy includes softmax as a special function:
https://scipy.github.io/devdocs/generated/scipy.special.softmax.html
I wrote a function applying the softmax over any axis:
def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""
# make X at least 2d
y = np.atleast_2d(X)
# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
# multiply y against the theta parameter,
y = y * float(theta)
# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)
# exponentiate y
y = np.exp(y)
# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()
return p
Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.

Here you can find out why they used - max.
From there:
"When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick."

I was curious to see the performance difference between these
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
def softmaxv2(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def softmaxv3(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / np.sum(e_x, axis=0)
def softmaxv4(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)
x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]
Using
print("----- softmax")
%timeit a=softmax(x)
print("----- softmaxv2")
%timeit a=softmaxv2(x)
print("----- softmaxv3")
%timeit a=softmaxv2(x)
print("----- softmaxv4")
%timeit a=softmaxv2(x)
Increasing the values inside x (+100 +200 +500...) I get consistently better results with the original numpy version (here is just one test)
----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop
Until.... the values inside x reach ~800, then I get
----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop
As some said, your version is more numerically stable 'for large numbers'. For small numbers could be the other way around.

A more concise version is:
def softmax(x):
return np.exp(x) / np.exp(x).sum(axis=0)

To offer an alternative solution, consider the cases where your arguments are extremely large in magnitude such that exp(x) would underflow (in the negative case) or overflow (in the positive case). Here you want to remain in log space as long as possible, exponentiating only at the end where you can trust the result will be well-behaved.
import scipy.special as sc
import numpy as np
def softmax(x: np.ndarray) -> np.ndarray:
return np.exp(x - sc.logsumexp(x))

I needed something compatible with the output of a dense layer from Tensorflow.
The solution from #desertnaut does not work in this case because I have batches of data. Therefore, I came with another solution that should work in both cases:
def softmax(x, axis=-1):
e_x = np.exp(x - np.max(x)) # same code
return e_x / e_x.sum(axis=axis, keepdims=True)
Results:
logits = np.asarray([
[-0.0052024, -0.00770216, 0.01360943, -0.008921], # 1
[-0.0052024, -0.00770216, 0.01360943, -0.008921] # 2
])
print(softmax(logits))
#[[0.2492037 0.24858153 0.25393605 0.24827873]
# [0.2492037 0.24858153 0.25393605 0.24827873]]
Ref: Tensorflow softmax

I would suggest this:
def softmax(z):
z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))
It will work for stochastic as well as the batch.
For more detail see :
https://medium.com/#ravish1729/analysis-of-softmax-function-ad058d6a564d

In order to maintain for numerical stability, max(x) should be subtracted. The following is the code for softmax function;
def softmax(x):
if len(x.shape) > 1:
tmp = np.max(x, axis = 1)
x -= tmp.reshape((x.shape[0], 1))
x = np.exp(x)
tmp = np.sum(x, axis = 1)
x /= tmp.reshape((x.shape[0], 1))
else:
tmp = np.max(x)
x -= tmp
x = np.exp(x)
tmp = np.sum(x)
x /= tmp
return x

Already answered in much detail in above answers. max is subtracted to avoid overflow. I am adding here one more implementation in python3.
import numpy as np
def softmax(x):
mx = np.amax(x,axis=1,keepdims = True)
x_exp = np.exp(x - mx)
x_sum = np.sum(x_exp, axis = 1, keepdims = True)
res = x_exp / x_sum
return res
x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

Everybody seems to post their solution so I'll post mine:
def softmax(x):
e_x = np.exp(x.T - np.max(x, axis = -1))
return (e_x / e_x.sum(axis=0)).T
I get the exact same results as the imported from sklearn:
from sklearn.utils.extmath import softmax

import tensorflow as tf
import numpy as np
def softmax(x):
return (np.exp(x).T / np.exp(x).sum(axis=-1)).T
logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])
sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

Based on all the responses and CS231n notes, allow me to summarise:
def softmax(x, axis):
x -= np.max(x, axis=axis, keepdims=True)
return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)
Usage:
x = np.array([[1, 0, 2,-1],
[2, 4, 6, 8],
[3, 2, 1, 0]])
softmax(x, axis=1).round(2)
Output:
array([[0.24, 0.09, 0.64, 0.03],
[0. , 0.02, 0.12, 0.86],
[0.64, 0.24, 0.09, 0.03]])

The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.
Softmax function is used when we have multiple classes.
It is useful for finding out the class which has the max. Probability.
The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.
It ranges from 0 to 1.
Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.
The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.
For example:
Before softmax
X = [13, 31, 5]
After softmax
array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]
Code:
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
# only difference

This also works with np.reshape.
def softmax( scores):
"""
Compute softmax scores given the raw output from the model
:param scores: raw scores from the model (N, num_classes)
:return:
prob: softmax probabilities (N, num_classes)
"""
prob = None
exponential = np.exp(
scores - np.max(scores, axis=1).reshape(-1, 1)
) # subract the largest number https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/
prob = exponential / exponential.sum(axis=1).reshape(-1, 1)
return prob

I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.
Here I give you some suggestions:
To get max, try to do it along x-axis, you will get an 1D array.
Reshape your max array to original shape.
Do np.exp get exponential value.
Do np.sum along axis.
Get the final results.
Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don't understand.

Goal was to achieve similar results using Numpy and Tensorflow. The only change from original answer is axis parameter for np.sum api.
Initial approach : axis=0 - This however does not provide intended results when dimensions are N.
Modified approach: axis=len(e_x.shape)-1 - Always sum on the last dimension. This provides similar results as tensorflow's softmax function.
def softmax_fn(input_array):
"""
| **#author**: Prathyush SP
|
| Calculate Softmax for a given array
:param input_array: Input Array
:return: Softmax Score
"""
e_x = np.exp(input_array - np.max(input_array))
return e_x / e_x.sum(axis=len(e_x.shape)-1)

Here is generalized solution using numpy and comparision for correctness with tensorflow ans scipy:
Data preparation:
import numpy as np
np.random.seed(2019)
batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)
Output:
logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822 0.3930805 ]
[0.62397 0.6378774 ]
[0.88049906 0.299172 ]]]
Softmax using tensorflow:
import tensorflow as tf
logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)
print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)
with tf.Session() as sess:
scores_np = sess.run(scores_tf)
print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)
print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))
Output:
logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.4965232 0.5034768 ]
[0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]
Softmax using scipy:
from scipy.special import softmax
scores_np = softmax(logits_np, axis=-1)
print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)
print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))
Output:
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.4965232 0.5034768 ]
[0.6413727 0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]
Softmax using numpy (https://nolanbconaway.github.io/blog/2017/softmax-numpy) :
def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""
# make X at least 2d
y = np.atleast_2d(X)
# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
# multiply y against the theta parameter,
y = y * float(theta)
# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)
# exponentiate y
y = np.exp(y)
# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()
return p
scores_np = softmax(logits_np, axis=-1)
print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)
print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))
Output:
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.49652317 0.5034768 ]
[0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can't tell which one is the "biggest" or "smallest" because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.
The purpose of subtracting the max value from the vector is that when you do e^y exponents you may get very high value that clips the float at the max value leading to a tie, which is not the case in this example. This becomes a BIG problem if you subtract the max value to make a negative number, then you have a negative exponent that rapidly shrinks the values altering the ratio, which is what occurred in poster's question and yielded the incorrect answer.
The answer supplied by Udacity is HORRIBLY inefficient. The first thing we need to do is calculate e^y_j for all vector components, KEEP THOSE VALUES, then sum them up, and divide. Where Udacity messed up is they calculate e^y_j TWICE!!! Here is the correct answer:
def softmax(y):
e_to_the_y_j = np.exp(y)
return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

This generalizes and assumes you are normalizing the trailing dimension.
def softmax(x: np.ndarray) -> np.ndarray:
e_x = np.exp(x - np.max(x, axis=-1)[..., None])
e_y = e_x.sum(axis=-1)[..., None]
return e_x / e_y

I used these three simple lines:
x_exp=np.exp(x)
x_sum=np.sum(x_exp, axis = 1, keepdims = True)
s=x_exp / x_sum

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: What gradients needed to be defined for custom operation? - python

Related

python tensorflow l2 loss over axis

Squared Mahalanobis distance function in Python returning array - why?

How to transform vector into unit vector in Tensorflow

Max margin loss in TensorFlow

How to implement the Softmax function in Python

Categories

Resources