Tensorflow Gradient Tape returning None

Tensorflow Gradient Tape returning None - python

I'm using TensorFlow 1.14.0 with Python(3.6.8). I'm trying to use tensorflow_probability's lbfgs optimizer implementation(documentation/example).
If I run the example code provided in the documentation it works fine. I tried to follow the same procedure for my own code which uses the tf.GradientTape() approach for computing the objective function. When doing it that way, the gradients come back as None type.
I'm not seeing why one is working, but the other is not.
Edit: I realized that running eager execution using the gradients wouldn't work, so I adjusted the example to be able to be run with eager execution.
Non-working example(using GradientTape) with eager execution
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
tf.enable_eager_execution()
# A high-dimensional quadratic bowl.
ndims = 3
minimum = np.ones([ndims], dtype='float64')
scales = np.arange(ndims, dtype='float64') + 1.0
# The objective function and the gradient.
def quadratic(x):
with tf.GradientTape() as g:
value = tf.reduce_sum(scales * (x - minimum) ** 2)
grads = g.gradient(value, x)
print('Gradients: ')
print(grads)
return value, grads
start = np.arange(ndims, 0, -1, dtype='float64')
optim_results = tfp.optimizer.lbfgs_minimize(quadratic, initial_position=start, num_correction_pairs=10,tolerance=1e-8)
print('results')
print(optim_results)
# Check that the search converged
assert(optim_results.converged)
# Check that the argmin is close to the actual value.
np.testing.assert_allclose(optim_results.position, minimum)

You need to watch x. For the operations inside this context manager, it's required at least one of their inputs is being watched.
with tf.GradientTape() as g:
g.watch(x)
value = tf.reduce_sum(scales * (x - minimum) ** 2)

Related

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a and b such that Y = a*X + b approximates an (x, y) point cloud (which I generated myself for the sake of simplicity).
I am solving this LR problem using a 'fixed step size gradient descent (FSSGD)'. I implemented it using TensorFlow and it works but I noticed that it is really slow both on GPU and CPU. Because I was curious I implemented the FSSGD myself in Python/NumPy and as expected this runs much faster, about:
10x faster than TF#CPU
20x faster than TF#GPU
If TensorFlow is this slow, I cannot imagine that so many people are using this framework. So I must be doing something wrong. Can anyone help me so I can speedup my TensorFlow implementation.
I'm NOT interested in the difference between the CPU and GPU performance. Both performance indicators are merely provided for completeness and illustration. I'm interested in why my TensorFlow implementation is so much slower than a raw Python/NumPy implementation.
As reference, I add my code below.
Stripped to a minimal (but fully working) example.
Using Python v3.7.9 x64.
Used tensorflow-gpu==1.15 for now (because the course uses TensorFlow v1)
Tested to run in both Spyder and PyCharm.
My FSSGD implementation using TensorFlow (execution time about 40 sec #CPU to 80 sec #GPU):
#%% General imports
import numpy as np
import timeit
import tensorflow.compat.v1 as tf
#%% Get input data
# Generate simulated input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define tensorflow model
# Define data size
n_samples = x_data_input.shape[0]
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
# Define variables to be learned
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
W = tf.get_variable("weights", (1, 1), initializer=tf.constant_initializer(0.0))
b = tf.get_variable("bias", (1,), initializer=tf.constant_initializer(0.0))
# Define loss function
Y_pred = tf.matmul(X, W) + b
loss = tf.reduce_sum((Y - Y_pred) ** 2 / n_samples) # Quadratic loss function
# %% Solve tensorflow model
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
#Construct TensorFlow optimizer
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
opt = tf.train.GradientDescentOptimizer(learning_rate = 1e-4)
opt_operation = opt.minimize(loss, name="GDO")
#To measure execution time
time_start = timeit.default_timer()
with tf.Session() as sess:
#Initialize variables
sess.run(tf.global_variables_initializer())
#Train variables
for index in range(int(total_iterations)):
_, loss_val_tmp = sess.run([opt_operation, loss], feed_dict={X: x_data, Y: y_data})
#Get final values of variables
W_val, b_val, loss_val = sess.run([W, b, loss], feed_dict={X: x_data, Y: y_data})
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W_val[0,0]))
print('b_val = {0:0.3f}'.format(b_val[0]))
print('')
My own python FSSGD implementation (execution time about 4 sec):
#%% General imports
import numpy as np
import timeit
#%% Get input data
# Define input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define Gradient Descent (GD) model
# Define data size
n_samples = x_data_input.shape[0]
#Initialize data
W = 0.0 # Initial condition
b = 0.0 # Initial condition
# Compute initial loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/n_samples # Quadratic loss function
#%% Execute Gradient Descent algorithm
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
GD_stepsize = 1e-4 # Gradient Descent fixed step size
#To measure execution time
time_start = timeit.default_timer()
for index in range(int(total_iterations)):
#Compute gradient (derived manually for the quadratic cost function)
loss_gradient_W = 2.0/n_samples*np.sum(-x_data_input*(y_data_input - y_gd_approx))
loss_gradient_b = 2.0/n_samples*np.sum(-1*(y_data_input - y_gd_approx))
#Update trainable variables using fixed step size gradient descent
W = W - GD_stepsize * loss_gradient_W
b = b - GD_stepsize * loss_gradient_b
#Compute loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/x_data_input.shape[0]
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W))
print('b_val = {0:0.3f}'.format(b))
print('')

I think it's the result of big iteration number. I've changed the iteration number from 1e5 to 1e3 and also changed x from x_data_input = np.arange(100, step=0.1) to x_data_input = np.arange(100, step=0.0001). This way I've reduced the iteration number but increased the computation by 10x. With np it's done in 22 sec and in tensorflow it's done in 25 sec.
My guess: tensorflow has alot of overhead in each iteration (to give us a framework that can do a lot) but the forward pass and backward pass speed are ok.

The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.
About the speed difference between TensorFlow and a raw Python/NumPy implementation
This part of the answer is actually quite logically.
Each iteration (= each call of Session.run()) TensorFlow performs computations. TensorFlow has a large overhead for starting each computation. On GPU, this overhead is even worse than on CPU. However, TensorFlow executes the actual computations very efficient and more efficiently than the above raw Python/NumPy implementation does.
So, when the number of data points is increased, and therefore the number of computations per iteration you will see that the relative performances between TensorFlow and Python/NumPy shifts in the advantage of TensorFlow. The opposite is also true.
The problem described in the question is very small meaning that the number of computation is very low while the number of iterations is very large. That is why TensorFlow performs so badly. This type of small problems is not the typical use case for which TensorFlow was designed.
To reduce the execution time
Still the execution time of the TensorFlow script can be reduced a lot! To reduce the execution time the number of iterations must be reduced (no matter the size of the problem, this is a good aim anyway).
As #amin's pointed out, this is achieved by scaling the input data. A very briefly explanation why this works: the size of the gradient and variable updates are more balanced compared to the absolute values for which the values are to be found. Therefore, less steps (= iterations) are required.
Followings #amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
### START NEW CODE ###
# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std
### END NEW CODE ###
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
Scaling speed up the convergence by a factor 1000. Instead of 1e5 iterations, 1e2 iterations are needed. This is partially because a maximum step size of 1e-1 can be used instead of a step size of 1e-4.
Please note that the found weight and bias are different and that you must feed scaled data from now on.
Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):
#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std

Trying to understand PyTorch SmoothL1Loss Implementation

I have been trying to go through all of the loss functions in PyTorch and build them from scratch to gain a better understanding of them and I’ve run into what is either an issue with my recreation, or an issue with PyTorch’s implementation.
According to Pytorch’s documentation for SmoothL1Loss it simply states that if the absolute value of the prediction minus the ground truth is less than beta, we use the top equation. Otherwise, we use the bottom one. Please see documentation for the equations.
Below is my implementation of this in the form of a minimum test:
import torch
import torch.nn as nn
import numpy as np
predictions = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
def l1_loss_smooth(predictions, targets, beta = 1.0):
loss = 0
for x, y in zip(predictions, targets):
if abs(x-y).mean() < beta:
loss += (0.5*(x-y)**2 / beta).mean()
else:
loss += (abs(x-y) - 0.5 * beta).mean()
loss = loss/predictions.shape[0]
output = l1_loss_smooth(predictions, target)
print(output)
Gives an output of:
tensor(0.7475, grad_fn=<DivBackward0>)
Now the Pytorch implementation:
loss = nn.SmoothL1Loss(beta=1.0)
output = loss(predictions, target)
Gives an output of:
tensor(0.7603, grad_fn=<SmoothL1LossBackward>)
I can’t figure out where the error in implementation lies.
Upon looking a little deeper into the smooth_l1_loss function in the _C module (file: smooth_c_loss_op.cc) I noticed that the doc string mentions that it’s a variation on Huber Loss but the documentation for SmoothL1Loss says it is Huber Loss.
So overall, just confused on how it’s implemented and whether it’s a combo of SmoothL1Loss and Huber Loss, Just Huber Loss, or something else.

The description in the documentation is correct. Your implementation wrongly applies the case selection on the mean of the data. It should be an element-wise selection instead (if you think about the implementation of the vanilla L1 loss, and the motivation for smooth L1 loss).
The following code gives a consistent result:
import torch
import torch.nn as nn
import numpy as np
predictions = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
def l1_loss_smooth(predictions, targets, beta = 1.0):
loss = 0
diff = predictions-targets
mask = (diff.abs() < beta)
loss += mask * (0.5*diff**2 / beta)
loss += (~mask) * (diff.abs() - 0.5*beta)
return loss.mean()
output = l1_loss_smooth(predictions, target)
print(output)
loss = nn.SmoothL1Loss(beta=1.0)
output = loss(predictions, target)
print(output)

Optimizing a vector using eager execution

I would like to use TensorFlow's eager execution functionality to optimize the components of a vector. In all the documented examples, each trainable variable is just a scalar, with collections represented by lists of these. However, the loss function I have in mind involves performing vector manipulations on those components, and so this is inconvenient.
For example, let us use the Adam optimizer to normalize a 3-component vector:
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
tf.enable_eager_execution()
def normalize(din=[2.0,1.0,0.0], lr=0.001,
nsteps=100):
d = tfe.Variable(din)
def loss(dvec):
return tf.sqrt((1.0 - tf.tensordot(dvec, dvec, 1))**2)
def grad(dvec):
with tf.GradientTape() as tape:
loss_val = loss(dvec)
return tape.gradient(loss_val, dvec)
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
for i in range(nsteps):
grads = grad(d)
optimizer.apply_gradients(zip(grads, d)) #Throws error
return d
This code correctly computes the required gradients. However, the "optimizer.apply_gradients" line throws some kind of error, seemingly no matter what I do, essentially because tfe.Variable is not an iterable.
In this specific example the error is "AttributeError: Tensor.name is meaningless when eager execution is enabled". We could also try, for example,
zip(grads, [d[i] for i in range(3)])
instead of d, but then the interpreter complains that d is not iterable.
What is the correct way to pair grads with d?

Optimizer.apply_gradients requires its first argument to be a list of (gradient, variable) pairs.
In the code above, neither grads nor d is a list (try print(type(grads)) for example), so the error is from the call to zip. I think what you want instead is:
optimizer.apply_gradients(zip([grads], [d]))
Or, more simply:
optimizer.apply_gradients([(grads, d)])
Also, FYI, as eager execution is stabilizing more things are moving out of the experimental "contrib" namespace, so you don't need the tfe module for your example (tf.Variable will work just fine) if you're using a recent version of TensorFlow (1.11, 1.12 etc.). Making your whole program look like:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def normalize(din=[2.0,1.0,0.0], lr=0.001,
nsteps=100):
d = tf.Variable(din)
def loss(dvec):
return tf.sqrt((1.0 - tf.tensordot(dvec, dvec, 1))**2)
def grad(dvec):
with tf.GradientTape() as tape:
loss_val = loss(dvec)
return tape.gradient(loss_val, dvec)
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
for i in range(nsteps):
dd = grad(d)
optimizer.apply_gradients([(dd, d)])
return d
Hope that helps!

Why is my implementations of the log-loss (or cross-entropy) not producing the same results?

I was reading up on log-loss and cross-entropy, and it seems like there are 2 approaches for calculating it, based on the following equations.
The first one is the following.
import numpy as np
from sklearn.metrics import log_loss
def cross_entropy(predictions, targets):
N = predictions.shape[0]
ce = -np.sum(targets * np.log(predictions)) / N
return ce
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.97]])
targets = np.array([[1,0,0,0],
[0,0,0,1]])
x = cross_entropy(predictions, targets)
print(log_loss(targets, predictions), 'our_answer:', ans)
The output of the previous program is 0.7083767843022996 our_answer: 0.71355817782, which is almost the same. So that's not the issue.
The above implementation is the middle part of the equation above.
The second approach is based on the RHS part of the equation above.
res = 0
for act_row, pred_row in zip(targets, np.array(predictions)):
for class_act, class_pred in zip(act_row, pred_row):
res += - class_act * np.log(class_pred) - (1-class_act) * np.log(1-class_pred)
print(res/len(targets))
And the output is 1.1549753967602232, which is not quite the same.
I have tried the same implementation with NumPy, but it also didn't work. What am I doing wrong?
PS: I am also curious that -y log (y_hat) seems to me that it's same as - sigma(p_i * log( q_i)) then how come there is a -(1-y) log(1-y_hat) part. Clearly I am misunderstanding how -y log (y_hat) is to be calculated.

I cannot reproduce the difference in the results you report in the first part (you also refer to an ans variable, which you do not seem to define, I guess it is x):
import numpy as np
from sklearn.metrics import log_loss
def cross_entropy(predictions, targets):
N = predictions.shape[0]
ce = -np.sum(targets * np.log(predictions)) / N
return ce
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.97]])
targets = np.array([[1,0,0,0],
[0,0,0,1]])
The results:
cross_entropy(predictions, targets)
# 0.7083767843022996
log_loss(targets, predictions)
# 0.7083767843022996
log_loss(targets, predictions) == cross_entropy(predictions, targets)
# True
Your cross_entropy function seems to work fine.
Regarding the second part:
Clearly I am misunderstanding how -y log (y_hat) is to be calculated.
Indeed, reading more carefully the fast.ai wiki you have linked to, you'll see that the RHS of the equation holds only for binary classification (where always one of y and 1-y will be zero), which is not the case here - you have a 4-class multinomial classification. So, the correct formulation is
res = 0
for act_row, pred_row in zip(targets, np.array(predictions)):
for class_act, class_pred in zip(act_row, pred_row):
res += - class_act * np.log(class_pred)
i.e. discarding the subtraction of (1-class_act) * np.log(1-class_pred).
Result:
res/len(targets)
# 0.7083767843022996
res/len(targets) == log_loss(targets, predictions)
# True
On a more general level (the mechanics of log loss & accuracy for binary classification), you may find this answer useful.

Why do I get different weights when using TensorFlow for multiple linear regression?

I have two implementations of multiple linear regressions, one using tensorflow and one using only numpy. I generate a dummy set of data and I try to recover the weights I used, but although the numpy one returns the initial weights, the tensorflow one always returns different weights (which also sort of work)
The numpy implementation is here, and here's the TF implementation:
import numpy as np
import tensorflow as tf
x = np.array([[i, i + 10] for i in range(100)]).astype(np.float32)
y = np.array([i * 0.4 + j * 0.9 + 1 for i, j in x]).astype(np.float32)
# Add bias
x = np.hstack((x, np.ones((x.shape[0], 1)))).astype(np.float32)
# Create variable for weights
n_features = x.shape[1]
np.random.rand(n_features)
w = tf.Variable(tf.random_normal([n_features, 1]))
w = tf.Print(w, [w])
# Loss function
y_hat = tf.matmul(x, w)
loss = tf.reduce_mean(tf.square(tf.sub(y, y_hat)))
operation = tf.train.GradientDescentOptimizer(learning_rate=0.000001).minimize(loss)
with tf.Session() as session:
session.run(tf.initialize_all_variables())
for iteration in range(5000):
session.run(operation)
weights = w.eval()
print(weights)
Running the script gets me weights around [-0.481, 1.403, 0.701], while running the numpy version gets me weights around [0.392, 0.907, 0.9288] which are much closer to the weights I used to generate the data: [0.4, 0.9, 1]
Both learning rates/epochs parameters are the same, and both initialise weights randomly. I don't normalize the data for either of the implementations, and I've ran them multiple times.
Why are the results different? I also tried to initialise weights in the TF version using w = tf.Variable(np.random.rand(n_features).reshape(n_features,1).astype(np.float32)) but that didn't fix it either. Is there something wrong with the TF implementation?

The problem appears to be with broadcasting. The shape of y_hat in the above is (100,1), while y is (100,). So, when you do tf.sub(y, y_hat) you end up with a matrix of (100,100) which are all the possible combinations of subtractions between the two vectors. I don't know, but I guess that you managed to avoid this in the numpy code.
Two ways to fix your code:
y = np.array([[i * 0.4 + j * 0.9 + 1 for i, j in x]]).astype(np.float32).T
or
y_hat = tf.squeeze(tf.matmul(x, w))
Although, that said, when I run this it still doesn't actually converge to the answer you want, but at least it's actually able to minimize the loss function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow Gradient Tape returning None - python

You need to watch x. For the operations inside this context manager, it's required at least one of their inputs is being watched. with tf.GradientTape() as g: g.watch(x) value = tf.reduce_sum(scales * (x - minimum) ** 2)

Related

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

Trying to understand PyTorch SmoothL1Loss Implementation

Optimizing a vector using eager execution

Why is my implementations of the log-loss (or cross-entropy) not producing the same results?

Why do I get different weights when using TensorFlow for multiple linear regression?

Categories

Resources