I wonder how to calculate higher order gradients through tf.py_function in tf2.0. The following example (slightly modified from tensorflow doc) produces the correct dy_dx, and aa_x is None. Thank you.
import tensorflow as tf
import os
def huber(x, delta):
if tf.abs(x) <= delta:
return x*x/ (2*delta)
else:
return tf.abs(x)-delta/2.0
x = tf.constant ([2.0 ] )
z = tf.constant ([1.0 ] )
with tf.GradientTape (persistent=True) as g0:
g0.watch(x)
with tf.GradientTape (persistent=True) as g :
g.watch (x)
y = tf.py_function(func=huber, inp=[x, 3.] , Tout=tf.float32 )
dy_dx = g.gradient(y, x)
aa = tf.reduce_sum(dy_dx *z )
aa_x = g0.gradient (aa, x)
print (dy_dx)
print (aa_x)
Based on the documentation of tf.py_function you cannot compute the higher >1st derivative. This function allows expressing computations in a TensorFlow graph as Python functions. In particular, it wraps a Python function func in a once-differentiable TensorFlow operation that executes it with eager execution enabled. Meaning you can only differentiate it once.
If you want to get higher-order derivatives, you can just use gradient function normally in Tensorflow 2.1.0.
Modified Code:
import tensorflow as tf # Tensorflow 2.1.0
import os
def huber(x, delta):
if tf.abs(x) <= delta:
return x*x/ (2*delta) ## x^2 / 2*delta
## x / delta - 1st derivative
## 1 / delta - 2nd derivative
else:
return tf.abs(x)-delta/2.0
x = tf.constant ([2.0 ])
z = tf.constant ([1.0 ])
with tf.GradientTape (persistent=True) as g0:
g0.watch(x)
with tf.GradientTape (persistent=True) as g :
g.watch (x)
# y = tf.py_function(func=huber, inp=[x, 3.0] , Tout=tf.float32 ) # once-differentiable
y= huber(x, 3.0)
dy_dx = g.gradient(y, x)
aa = tf.reduce_sum(dy_dx *z)
aa_x = g0.gradient(aa, x)
print (dy_dx) # tf.Tensor([0.6666667], shape=(1,), dtype=float32)
print (aa_x) # tf.Tensor([0.33333334], shape=(1,), dtype=float32)
You can read more about tf.py_wrap function in this link.
Related
In PyTorch, there are two ways of calculating second gradients. The first method is to use torch.autograd.grad function, and the other is to use backward function. I use the following examples to illustrate it:
Method 1:
x=torch.tensor([3.0], requires_grad=True)
y = torch.pow(x, 2)
grad_1 = torch.autograd.grad(y, x, create_graph=True)
print(grad_1[0].item())
grad_2 = torch.autograd.grad(grad_1[0], x)
print(grad_2)
The result makes sense for me, and the second gradient of the function is 2.
Method 2:
x=torch.tensor([3.0], requires_grad=True)
y = torch.pow(x, 2) # y=x**2
y.backward(retain_graph=True)
print(x.grad)
y.backward()
print(x.grad)
When calculating the first gradient, I use create_graph=True to make sure that we can use back prorogation method to calculate the second gradient. However, the result is is 12, which is wrong. I was wondering what's wrong with the second method?
Use the grad method from torch.autograd to differentiate your function. So the steps would be:
>>> import torch
>>> from torch.autograd import grad
>>> x = torch.tensor([3.0], requires_grad=True)
>>> y = torch.pow(x,2)
>>> z = grad(y, x, create_graph=True)
>>> print(grad(z, x, create_graph=True))
>>> (tensor([2.], grad_fn=<MulBackward0>),)
Similarly, you can loop through to make the nth derivative.
I implemented the softmax() function, softmax_crossentropy() and the derivative of softmax cross entropy: grad_softmax_crossentropy(). Now I wanted to compute the derivative of the softmax cross entropy function numerically. I tried to do this by using the finite difference method but the function returns only zeros. Here is my code with some random data:
import numpy as np
batch_size = 3
classes = 10
# random preactivations
a = np.random.randint(1,100,(batch_size,classes))
# random labels
y = np.random.randint(0,np.size(a,axis=1),(batch_size,1))
def softmax(a):
epowa = np.exp(a-np.max(a,axis=1,keepdims=True))
return epowa/np.sum(epowa,axis=1,keepdims=True)
print(softmax(a))
def softmax_crossentropy(a, y):
y_one_hot = np.eye(classes)[y[:,0]]
return -np.sum(y_one_hot*np.log(softmax(a)),axis=1)
print(softmax_crossentropy(a, y))
def grad_softmax_crossentropy(a, y):
y_one_hot = np.eye(classes)[y[:,0]]
return softmax(a) - y_one_hot
print(grad_softmax_crossentropy(a, y))
# Finite difference approach to compute grad_softmax_crossentropy()
eps = 1e-5
print((softmax_crossentropy(a+eps,y)-softmax_crossentropy(a,y))/eps)
What did I wrong?
Here's how you could do it. I think you're referring to the gradient wrt the activations indicated by y's indicator matrix.
First, I instantiate a as float to change individual items.
a = np.random.randint(1,100,(batch_size,classes)).astype("float")
Then,
np.diag(grad_softmax_crossentropy(a, y)[:, y.flatten()])
array([ -1.00000000e+00, -1.00000000e+00, -4.28339542e-04])
But also
b = a.copy()
for i, o in zip(y.max(axis=1), range(y.shape[0])):
b[o, i] += eps
(softmax_crossentropy(b,y)-softmax_crossentropy(a,y))/eps
[ -1.00000000e+00 -1.00000000e+00 -4.28125536e-04]
So basically you have to change a_i in softmax, not the entirety of a.
I want to create heaviside step function in TensorFlow. Since Heaviside function is not differentiable I also need to choose derivative approximation and define custom gradient so full implementation looks like this:
import tensorflow as tf
#tf.RegisterGradient("HeavisideGrad")
def _heaviside_grad(unused_op: tf.Operation, grad: tf.Tensor):
x = unused_op.inputs[0]
# During backpropagation heaviside behaves like sigmoid
return tf.sigmoid(x) * (1 - tf.sigmoid(x)) * grad
def heaviside(x: tf.Tensor, g: tf.Graph = tf.get_default_graph()):
custom_grads = {
"Sign": "HeavisideGrad"
}
with g.gradient_override_map(custom_grads):
# TODO: heaviside(0) currently returns 0. We need heaviside(0) = 1
sign = tf.sign(x)
# tf.stop_gradient is needed to exclude tf.maximum from derivative
step_func = sign + tf.stop_gradient(tf.maximum(0.0, sign) - sign)
return step_func
There is one caveat in my implementation: tf.sign(0) returns zero value so heaviside(0) also returns zero and I want heaviside(0) to return 1. How can I achieve such behavior?
A very hacky way would be to use
1 - max(0.0, sign(-x))
as your step function instead of
max(0.0, sign(x))
Another option would be to use greater_equal and cast the result to your desired type, and override its gradient with the sigmoid override you already have.
Ok, I think I figured it out. Many thanks to etarion who pointed out the correct approach to solve my issue.
So the basic idea is to use tf.greater_equal instead of combination of tf.sign and maximum. The custom gradient is applied to tf.identity operation.
Here is updated implementation of heaviside function:
import tensorflow as tf
#tf.RegisterGradient("HeavisideGrad")
def _heaviside_grad(unused_op: tf.Operation, grad: tf.Tensor):
return tf.maximum(0.0, 1.0 - tf.abs(unused_op.inputs[0])) * grad
def heaviside(x: tf.Tensor, g: tf.Graph = tf.get_default_graph()):
custom_grads = {
"Identity": "HeavisideGrad"
}
with g.gradient_override_map(custom_grads):
i = tf.identity(x, name="identity_" + str(uuid.uuid1()))
ge = tf.greater_equal(x, 0, name="ge_" + str(uuid.uuid1()))
# tf.stop_gradient is needed to exclude tf.to_float from derivative
step_func = i + tf.stop_gradient(tf.to_float(ge) - i)
return step_func
This would make the unit step function, using only TensorFlow APIs so the result is still a tensor:
#in Eager mode
def heaviside(v):
return 1-tf.reduce_max(tf.constant([0,-tf.sign(v).numpy()], tf.float32));
In TensorFlow 2, use the decorator #tf.custom_gradient better:
#tf.custom_gradient
def heaviside(X):
#This custom op is converted to graph, no 'if', 'else' allowed,
#so use 'tf.cond'
List = [];
for I in range(BSIZE): #Batch size
Item = tf.cond(X[I]<0, lambda: tf.constant([0], tf.float32),
lambda: tf.constant([1], tf.float32));
List.append(Item);
U = tf.stack(List);
#Heaviside half-maximum formula
#U = (tf.sign(X)+1)/2;
#Div is differentiation intermediate value
def grad(Div):
return Div*1; #Heaviside has no gradient, use 1.
return U,grad;
Easiest fix for you code is to add a small number to the result of tf.sign() and take the sign again. This will result in getting a 1 for 0:
sign = tf.sign ( tf.sign( x ) + 0.1 )
How do I implement this metric in Keras? My code below gives the wrong result!
Note that I'm undoing a previous log(x + 1) transformation via exp(x) - 1, also negative predictions are clipped to 0:
def rmsle_cust(y_true, y_pred):
first_log = K.clip(K.exp(y_pred) - 1.0, 0, None)
second_log = K.clip(K.exp(y_true) - 1.0, 0, None)
return K.sqrt(K.mean(K.square(K.log(first_log + 1.) - K.log(second_log + 1.)), axis=-1)
For comparison, here's the standard numpy implementation:
def rmsle_cust_py(y, y_pred, **kwargs):
# undo 1 + log
y = np.exp(y) - 1
y_pred = np.exp(y_pred) - 1
y_pred[y_pred < 0] = 0.0
to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
return (sum(to_sum) * (1.0/len(y))) ** 0.5
What I'm doing wrong? Thanks!
EDIT: Setting axis=0 seems to give a value very close to the correct one, but I'm not sure since all the code I've seem uses axis=-1.
I ran into the same problem and searched for it, here is what I found
https://www.kaggle.com/jpopham91/rmlse-vectorized
After modified a bit, this seems to work for me,rmsle_K method implemented with Keras and TensorFlow.
import numpy as np
import math
from keras import backend as K
import tensorflow as tf
def rmsle(y, y0):
assert len(y) == len(y0)
return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))
def rmsle_loop(y, y0):
assert len(y) == len(y0)
terms_to_sum = [(math.log(y0[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y0)]
return (sum(terms_to_sum) * (1.0/len(y))) ** 0.5
def rmsle_K(y, y0):
return K.sqrt(K.mean(K.square(tf.log1p(y) - tf.log1p(y0))))
r = rmsle(y=[5, 20, 12], y0=[8, 16, 12])
r1 = rmsle_loop(y=[5, 20, 12], y0=[8, 16, 12])
r2 = rmsle_K(y=[5., 20., 12.], y0=[8., 16., 12.])
print(r)
print(r1)
sess = tf.Session()
print(sess.run(r2))
Result:
Using TensorFlow backend
0.263978210565
0.263978210565
0.263978
By the use of a list (to_sum) in the numpy implementation, I suspect your numpy array has shape (length,).
And on Keras, since you've got different results with axis=0 and axis=1, you probably got some shape like (length,1).
Also, when creating the to_sum list, you're using y[i] and y_pred[i], which means you're taking elements from the axis=0 in numpy implementation.
The numpy implementation also sums everything for calculating the mean in sum(to_sum). So, you really don't need to use any axis in the K.mean.
If you make sure your model's output shape is either (length,) or (length,1), you can use just K.mean(value) without passing the axis parameter.
I'm starting to play around with theano, and so I tried computing a simple function and testing the output, however when I test a theano compiled version versus a non theano version the outputs are a bit different....
The code:
import numpy as np
import theano.tensor as T
from theano import function
np.random.seed(1)
S = np.random.rand(4,3)
Q = np.random.rand(4,3)
def MSE(a, b):
n = min(a.shape[0], b.shape[0])
fhat = T.dvector('fhat')
y = T.dvector('y')
mse = ((y - fhat)**2).sum() / n
mse_f = function([y, fhat], mse)
return mse_f(a,b)
for row in range(S.shape[0]):
print(MSE(S[row], Q[row]))
for i in range(S.shape[0]):
print(((S[i] - Q[i])**2).sum() / S.shape[0])
the outputs:
# from MSE function
0.0623486922837
0.0652202301174
0.151698460419
0.187325204482
# non theano output
0.0467615192128
0.0489151725881
0.113773845314
0.140493903362
What am I over looking here?
In the expression in this statement
print(((S[i] - Q[i])**2).sum() / S.shape[0])
you should divide by S.shape[1], not S.shape[0].
You created S using S = np.random.rand(4,3), which means S has shape (4, 3). That is, S.shape is (4, 3). The length of each row in S is S.shape[1].