How to find minimum of a function with TensorFlow - python

I try to implement a visualization for optimization algorithms in TensorFlow.
Therefore I started with the Beale's function
The global minimum is at
A plot of the Beale's function does look like this
I would like to start at the point f(x=3.0, y=4.0)
How do I implement this in TensorFlow with optimization algorithms?
My first try looks like this
import tensorflow as tf
# Beale's function
x = tf.Variable(3.0, trainable=True)
y = tf.Variable(4.0, trainable=True)
f = tf.add_n([tf.square(tf.add(tf.subtract(1.5, x), tf.multiply(x, y))),
tf.square(tf.add(tf.subtract(2.25, x), tf.multiply(x, tf.square(y)))),
tf.square(tf.add(tf.subtract(2.625, x), tf.multiply(x, tf.pow(y, 3))))])
Y = [3, 0.5]
loss = f
opt = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
sess = tf.Session()
for i in range(100):
print([x, y, loss]))
Obviously this doesn't work. I guess I have to define a correct loss, but how?
To clearify: My problem is that I don't understand how TensorFlow works and I don't know much python (coming from Java, C, C++, Delphi, ...). My question is not on how this works and what the best optimization methods are, it's only about how to implement this in a correct way.

Oh I already could figure it out.
The problem is that I need to clip the max and min values of x and y to -4.5 and 4.5 so that they don't explode to infinite.
This solution works:
import tensorflow as tf
# Beale's function
x = tf.Variable(3.0, trainable=True)
y = tf.Variable(4.0, trainable=True)
f = tf.add_n([tf.square(tf.add(tf.subtract(1.5, x), tf.multiply(x, y))),
tf.square(tf.add(tf.subtract(2.25, x), tf.multiply(x, tf.square(y)))),
tf.square(tf.add(tf.subtract(2.625, x), tf.multiply(x, tf.pow(y, 3))))])
opt = tf.train.GradientDescentOptimizer(0.01)
grads_and_vars = opt.compute_gradients(f, [x, y])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]
train = opt.apply_gradients(clipped_grads_and_vars)
sess = tf.Session()
for i in range(100):
print([x, y]))
If someone knows if it's possible to add multiple neurons / layers to this code, please feel free to write a answer.


what is the equivalent sytax for `Tensor.grad` in Tensorflow 2.0

In the Pytorch, we can access the gradient of a variable x by
What is the same sytax in Tensorflow 2. My goad is to cut the gradient. Here is Pytorch code
if z.grad > 1000:
z.grad = 10
Can tensorflow 2 apply the same functions?
So in TF2, assume we define following variables and optimizer:
import tensorflow as tf
from tensorflow import keras
opt = tf.keras.optimizers.Adam(learning_rate=0.1)
x = tf.Variable([3.0, 4.0])
y = tf.Variable([1.0, 1.0, 1.0, 1.0])
var_list = [x, y]
Then we can get gradients by using tf.GradientTape():
with tf.GradientTape() as tape:
loss = tf.reduce_sum(x ** 2) + tf.reduce_sum(y)
grads = tape.gradient(loss, var_list)
Finally we could process the gradients by custom function:
def clip_grad(grad):
if grad > 1000:
grad = 10
return grad
processed_grads = [tf.map_fn(clip_grad, g) for g in grads]
opt.apply_gradients(zip(processed_grads, var_list))
Note you may find the keras optimizers have get_gradients method, but it won't work with eager execution enabled which is default in TF2, if you want use that, then you may have to write code in TF1 fashion

Using tf.py_func as loss function to implement gradient descent

I'm trying to use tf.train.GradientDescentOptimizer().minimize(loss) to get the minimum value of the loss function. But the loss function is very complicated and I need to use numpy to calculate the value, so I use tf.py_func to change the output to tensor again and try to use gradient descent to get the result. But this occurs the error: No gradients provided for any variable.
def getvalue(w):
return w
train_X = np.array([1,2,3,4,5])
train_Y = np.array([34,56,78,27,96])
X = tf.placeholder("float")
Y = tf.placeholder("float")
w = tf.Variable([0.0,0,0,0,0], name="weight")
loss = tf.py_func(getvalue,[w],tf.float32)
train_op = tf.train.GradientDescentOptimizer(0.02).minimize(loss)
with tf.Session() as sess:
temp = 1
for i in range(100):
for (x, y) in zip(train_X, train_Y):
_, w_value =[train_op, w],feed_dict={X: x,Y: y})
temp += 1
Here's a very simple loss function, but when compile it still occurs No gradients provided for any variable.
So is there's a way to solve it? Do I need to calculate everything in tensor in order to use gradient descent?
Thanks in advance.

Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?

When using GradientDescentOptimizer instead of Adam Optimizer the model doesn't seem to converge. On the otherhand, AdamOptimizer seems to work fine. Is the something wrong with the GradientDescentOptimizer from tensorflow?
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
def randomSample(size=100):
y = 2 * x -3
x = np.random.randint(500, size=size)
y = x * 2 - 3 - np.random.randint(-20, 20, size=size)
return x, y
def plotAll(_x, _y, w, b):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(_x, _y)
x = np.random.randint(500, size=20)
y = w * x + b
ax.plot(x, y,'r')
def lr(_x, _y):
w = tf.Variable(2, dtype=tf.float32)
b = tf.Variable(3, dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = w * x + b
loss = tf.reduce_sum(tf.square(linear_model - y))
optimizer = tf.train.AdamOptimizer(0.0003) #GradientDescentOptimizer
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
sess = tf.Session()
for i in range(10000):, {x : _x, y: _y})
cw, cb, closs =[w, b, loss], {x:_x, y:_y})
return cw, cb
x,y = randomSample()
w,b = lr(x,y)
plotAll(x,y, w, b)
I had a similar problem once and it took me a long time to find out the real problem. With gradient descent my loss function was actually growing instead of getting smaller.
It turned out that my learning rate was too high. If you take too big of a step with gradient descent you can end up jumping over the minimum. And if you are really unlucky, like I was you end up jumping so far ahead that your error increases.
Lowering the learning rate should make the model converge. But it could take a long time.
Adam optimizer has momentum, that is, it doesn't just follow the instantaneous gradient, but it keeps track of the direction it was going before with a sort of velocity. This way, if you start going back and forth because of the gradient than the momentum will force you to go slower in this direction. This helps a lot! Adam has a few more tweeks other than momentum that make it the prefered deep learning optimizer.
If you want to read more about optimizers this blog post is very informative.

TensorFlow Variables and Constants

I am new to tensorflow , I am not able to understand the difference of variable and constant, I get the idea that we use variables for equations and constants for direct values , but why code #1 works only and why not code#2 and #3, and please explain in which cases we have to run our graph first(a) and then our variable(b) i.e
(b) print(
and in which case I can directly execute this command
Code #1 :
x = tf.constant(35, name='x')
y = tf.Variable(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
Code #2 :
x = tf.Variable(35, name='x')
y = tf.Variable(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
Code #3 :
x = tf.constant(35, name='x')
y = tf.constant(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
In TensorFlow the differences between constants and variables are that when you declare some constant, its value can't be changed in the future (also the initialization should be with a value, not with operation).
Nevertheless, when you declare a Variable, you can change its value in the future with tf.assign() method (and the initialization can be achieved with a value or operation).
The function tf.global_variables_initializer() initialises all variables in your code with the value passed as parameter, but it works in async mode, so doesn't work properly when dependencies exists between variables.
Your first code (#1) works properly because there is no dependencies on variable initialization and the constant is constructed with a value.
The second code (#2) doesn't work because of the async behavior of tf.global_variables_initializer(). You can fix it using tf.variables_initializer() as follows:
x = tf.Variable(35, name='x')
model_x = tf.variables_initializer([x])
y = tf.Variable(x + 5, name='y')
model_y = tf.variables_initializer([y])
with tf.Session() as session:
The third code (#3) doesn't work properly because you are trying to initialize a constant with an operation, that isn't possible. To solve it, an appropriate strategy is (#1).
Regarding to your last question. You need to run (a) when there are variables in your calculation graph (b) print(
I will point the difference when using eager execution.
As of Tensorflow 2.0.b1, Variables and Constant trigger different behaviours when using tf.GradientTape. Strangely, the official document is not verbal about it enough.
Let's look at the example code in
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
y = x * x
z = y * y
dz_dx = g.gradient(z, x) # 108.0 (4*x^3 at x = 3)
dy_dx = g.gradient(y, x) # 6.0
del g # Drop the reference to the tape
You had to watch x which is a Constant. GradientTape does NOT automatically watch constants in the context. Additionally, it can watch only one tensor per GradientTape. If you want to get gradients of multiple Constants, you need to nest GradientTapes. For example,
x = tf.constant(3.0)
x2 = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
with tf.GradientTape(persistent=True) as g2:
y = x * x
y2 = y * x2
dy_dx = g.gradient(y, x) # 6
dy2_dx2 = g2.gradient(y2, x2) # 9
del g, g2 # Drop the reference to the tape
On the other hand, Variables are automatically watched by GradientTape.
By default GradientTape will automatically watch any trainable variables that are accessed inside the context. Source:
So the above will look like,
x = tf.Variable(3.0)
x2 = tf.Variable(3.0)
with tf.GradientTape(persistent=True) as g:
y = x * x
y2 = y * x2
dy_dx = g.gradient(y, x) # 6
dy2_dx2 = g.gradient(y2, x2) # 9
del g # Drop the reference to the tape
Of course, you can turn off the automatic watching by passing watch_accessed_variables=False. The examples may not be so practical but I hope this clears someone's confusion.
Another way to look to the differences is:
tf.constant : are fixed values, and hence not trainable.
tf.Variable: these are tensors (arrays) that were initialized in a session and are trainable (with trainable i mean this can be optimized and can changed over time)

Tensorflow: Linear regression with non-negative constraints

I am trying to implement a linear regression model in Tensorflow, with additional constraints (coming from the domain) that the W and b terms must be non-negative.
I believe there are a couple of ways to do this.
We can modify the cost function to penalize negative weights [Lagrangian approach] [See:TensorFlow - best way to implement weight constraints
We can compute the gradients ourselves and project them on [0, infinity] [Projected gradient approach]
Approach 1: Lagrangian
When I tried the first approach, I would often end up with negative b.
I had modified the cost function from:
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 100.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
Keeping the coefficient of nn_b and nn_w to be very high leads to instability and very high cost.
Here is the complete code.
import numpy as np
import tensorflow as tf
n_samples = 50
train_X = np.linspace(1, 50, n_samples)
train_Y = 10*train_X + 6 +40*np.random.randn(50)
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Gradient descent
# Initializing the variables
init = tf.global_variables_initializer()
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 1.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_with_constraint)
with tf.Session() as sess:
# Fit all training data
cost_array = np.zeros(training_epochs)
W_array = np.zeros(training_epochs)
b_array = np.zeros(training_epochs)
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):, feed_dict={X: x, Y: y})
W_array[epoch] =
b_array[epoch] =
cost_array[epoch] =, feed_dict={X: train_X, Y: train_Y})
The following is the mean of b across 10 different runs.
0 -1.101268
1 0.169225
2 0.158363
3 0.706270
4 -0.371205
5 0.244424
6 1.312516
7 -0.069609
8 -1.032187
9 -1.711668
Clearly, the first approach is not optimal. Further, there is a lot of art involved in choosing the coefficient of penalty terms.
Approach 2: Projected gradient
I then thought to use the second approach, which is more guaranteed to work.
gr = tf.gradients(cost, [W, b])
We manually compute the gradients and update the W and b.
with tf.Session() as sess:
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
W_del, b_del =, feed_dict={X: x, Y: y})
W = max(0, (W - W_del)*learning_rate) #Project the gradient on [0, infinity]
b = max(0, (b - b_del)*learning_rate) # Project the gradient on [0, infinity]
This approach seems to be very slow.
I am wondering if there is a better way to run the second approach, or guarantee the results with the first approach. Can we somehow allow the optimizer to ensure that the learnt weights are non-negative?
Edit: How to do this in Autograd
If you modify your linear model to:
pred = tf.add(tf.multiply(X, tf.abs(W)), tf.abs(b))
it will have the same effect as using only positive W and b values.
The reason your second approach is slow is that you clip the W and b values outside of the tensorflow graph. (Also it will not converge because (W - W_del)*learning_rate must instead be W - W_del*learning_rate)
You can implement the clipping using tensorflow graph like this:
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.control_dependencies([train_step]):
clip_W = W.assign(tf.maximum(0., W))
clip_b = b.assign(tf.maximum(0., b))
train_step_with_clip =, clip_b)
In this case W and b values will be clipped to 0 and not to small positive numbers.
Here is a small mnist example with clipping:
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x = tf.placeholder(tf.uint8, [None, 28, 28])
x_vec = tf.cast(tf.reshape(x, [-1, 784]), tf.float32) / 255.
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x_vec, W) + b
y_target = tf.placeholder(tf.uint8, [None])
y_target_one_hot = tf.one_hot(y_target, 10)
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_target_one_hot, logits=y))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
with tf.control_dependencies([train_step]):
clip_W = W.assign(tf.maximum(0., W))
clip_b = b.assign(tf.maximum(0., b))
train_step_with_clip =, clip_b)
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
for i in range(1000):, feed_dict={
x: x_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)],
y_target: y_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)]})
if not i%100:
print("Accuracy:",, feed_dict={
x: x_test,
y_target: y_test}))
I actually was not able to reproduce your problem of getting negative bs with your first approach.
But I do agree that this is not optimal for your use case and can result in negative values.
You should be able to constrain your parameters to non-negative values like so:
W *= tf.cast(W > 0., tf.float32)
b *= tf.cast(b > 0., tf.float32)
(exchange > with >= if necessary, the cast is necessary as the comparison operators will produce boolean values.
You then would optimize for the "standard cost" without the additional constraints.
However, this does not work in every case. For example, it should be avoided to initialize W or b with negative values in the beginning.
Your second (and probably better) approach can be accelerated by defining the update logic in the general computational graph, i.e. after the definition of cost
params = [W, b]
grads = tf.gradients(cost, params)
optimizer = [tf.assign(param, tf.maximum(0., param - grad*learning_rate))
for param, grad in zip(params, grads)]
I think your solution is slow because it creates new computation nodes every time which is probably very costly and repeated a lot inside the loops.
update using tensorflow optimizer
In my solution above not the gradients are clipped but rather the resulting update values.
Along the lines of this answer you could clip the gradients to be at most the value of the updated parameter like so:
params = [W, b]
opt = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = opt.compute_gradients(cost, params)
clipped_grads_vars = [(tf.clip_by_value(grad, -np.inf, var), var) for grad, var in grads_and_vars]
optimizer = opt.apply_gradients(clipped_grads_vars)
This way an update will never decrease a parameter to a value below 0.
However, I think this will not work in the case the updated variable is already negative.
Also, if the optimizing algorithm somehow multiplies the clipped gradient by a value greater than 1.
The latter might actually never happen, but I'm not 100% sure.

