Following Tensorflow LSTM Regularization I am trying to add regularization term to the cost function when training parameters of LSTM cells.
Putting aside some constants I have:
def RegularizationCost(trainable_variables):
cost = 0
for v in trainable_variables:
cost += r(tf.reduce_sum(tf.pow(r(v.name),2)))
return cost
...
regularization_cost = tf.placeholder(tf.float32, shape = ())
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
...
tv = tf.trainable_variables()
s = tf.Session()
r = s.run
...
while (...):
...
reg_cost = RegularizationCost(tv)
r(optimizer, feed_dict = {x: x_b, y: y_b, regularization_cost: reg_cost})
The problem I have is that adding the regularization term hugely slows the learning process and actually the regularization term reg_cost is increasing with each iteration visibly when the term associated with pred - y pretty much stagnated i.e. the reg_cost seems not to be taken into account.
As I suspect I am adding this term in completely wrong way. I did not know how to add this term in the cost function itself so I used a workaround with scalar tf.placeholder and "manually" calculated the regularization cost. How to do it properly?
compute the L2 loss only once:
tv = tf.trainable_variables()
regularization_cost = tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv ])
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
you might want to remove the variables that are bias as those should not be regularized.
It slows down because your code creates new nodes in every iteration. This is not how you code with TF. First, you create your whole graph, including regularization terms, then, in the while loop you only execute them, each "tf.XXX" operation creates new nodes.
Related
I have made a small script in Python to solve various Gym environments with policy gradients.
import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2
#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also
print(policy.weights)
DISCOUNT = 0.95 #parameter for discounting future rewards
#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)
for t in range(3000):
done = False
states = [] #lists for recording episode
probs2 = []
rewards = []
while not done:
#env.render() #to visualize learning
probs = policy.feedforward(state)[-1] #calculate probabilities of actions
action = np.random.choice(a_size,p=probs) #choose action from probs
#record and update state
probs2.append(probs)
states.append(state)
state,reward,done,info = env.step(action)
rewards.append(reward) #should reward be before updating state?
#calculate gradients
gradients_w = []
gradients_b = []
for i in range(len((rewards))):
totalReward = sum([rewards[t]*DISCOUNT**t for t in range(len(rewards[i:]))]) #discounted reward
## !! this is the line that I need help with
gradient = policy.backpropagation(states[i],totalReward*(probs2[i])) #what should be backpropagated through the network
## !!
##record gradients
gradients_w.append(gradient[0])
gradients_b.append(gradient[1])
#combine gradients and update the weights and biases
gradients_w = np.array(gradients_w,object)
gradients_b = np.array(gradients_b,object)
policy.weights += policy.learning_rate * np.flip(np.sum(gradients_w,0),0) #np.flip because the gradients are calculated backwards
policy.biases += policy.learning_rate * np.flip(np.sum(gradients_b,0),0)
#reset and record
env.reset()
if t%100==0:
print('t'+str(t),'r',sum(rewards))
What should be passed backwards to calculate the gradients? I am using gradient ascent but I could switch it to descent. Some people have defined the reward function as totalReward*log(probabilities). Would that make the score derivative totalReward*(1/probs) or log(probs) or something else? Do you use a cost function like cross entropy?
I have tried
totalReward*np.log(probs)
totalReward*(1/probs)
totalReward*(probs**2)
totalReward*probs
probs = np.zeros(a_size)
probs[action] = 1
totalRewards*probs
and a couple others.
The last one is the only one that was able to solve any of them and it only worked on Cartpole. I have tested the various loss or score functions for thousands of episodes with gradient ascent and descent on Cartpole, Pendulum, and MountainCar. Sometimes it will improve a small amount but it will never solve it. What am I doing wrong?
And here is the RLPolicy code. It is not well written or pseudo coded but I don't think it is the problem because I checked it with gradient checking several times. But it would be helpful even if I could narrow it down to a problem with the neural network or somewhere else in my code.
#Neural Network
import numpy as np
import random, math, time, os
from matplotlib import pyplot as plt
def activation(x,function):
if function=='sigmoid':
return(1/(1+math.e**(-x))) #Sigmoid
if function=='relu':
x[x<0]=0
return(x)
if function=='tanh':
return(np.tanh(x.astype(float))) #tanh
if function=='softmax':
z = np.exp(np.array((x-max(x)),float))
y = np.sum(z)
return(z/y)
def activationDerivative(x,function):
if function=='sigmoid':
return(x*(1-x))
if function=='relu':
x[x<0]==0
x[x>0]==1
return(x)
if function=='tanh':
return(1-x**2)
if function=='softmax':
s = x.reshape(-1,1)
return(np.diagflat(s) - np.dot(s, s.T))
class NeuralNetwork():
def __init__ (self,layers,learning_rate,momentum,regularization,activations):
self.learning_rate = learning_rate
if (isinstance(layers[1],list)):
h = layers[1][:]
del layers[1]
for i in h:
layers.insert(-1,i)
self.layers = layers
self.weights = [2*np.random.rand(self.layers[i]*self.layers[i+1])-1 for i in range(len(self.layers)-1)]
self.biases = [2*np.random.rand(self.layers[i+1])-1 for i in range(len(self.layers)-1)]
self.weights = np.array(self.weights,object)
self.biases = np.array(self.biases,object)
self.activations = activations
def feedforward(self, input_array):
layer = input_array
neuron_outputs = [layer]
for i in range(len(self.layers)-1):
layer = np.tile(layer,self.layers[i+1])
layer = np.reshape(layer,[self.layers[i+1],self.layers[i]])
weights = np.reshape(self.weights[i],[self.layers[i+1],self.layers[i]])
layer = weights*layer
layer = np.sum(layer,1)#,self.layers[i+1]-1)
layer = layer+self.biases[i]
layer = activation(layer,self.activations[i])
neuron_outputs.append(np.array(layer,float))
return(neuron_outputs)
def neuronErrors(self,l,neurons,layerError,n_os):
if (l==len(self.layers)-2):
return(layerError)
totalErr = [] #total error
for e in range(len(layerError)): #-layers
e = e*self.layers[l+2]
a_ws = self.weights[l+1][e:e+self.layers[l+1]]
e = int(e/self.layers[l+2])
err = layerError[e]*a_ws #error
totalErr.append(err)
return(sum(totalErr))
def backpropagation(self,state,loss):
weights_gradient = [np.zeros(self.layers[i]*self.layers[i+1]) for i in range(len(self.layers)-1)]
biases_gradient = [np.zeros(self.layers[i+1]) for i in range(len(self.layers)-1)]
neuron_outputs = self.feedforward(state)
grad = self.individualBackpropagation(loss, neuron_outputs)
return(grad)
def individualBackpropagation(self, difference, neuron_outputs): #number of output
lr = self.learning_rate
n_os = neuron_outputs[:]
w_o = self.weights[:]
b_o = self.biases[:]
w_n = self.weights[:]
b_n = self.biases[:]
gradient_w = []
gradient_b = []
error = difference[:] #error for neurons
for l in range(len(self.layers)-2,-1,-1):
p_n = np.tile(n_os[l],self.layers[l+1]) #previous neuron
neurons = np.arange(self.layers[l+1])
error = (self.neuronErrors(l,neurons,error,n_os))
if not self.activations[l]=='softmax':
error = error*activationDerivative(neuron_outputs[l+1],self.activations[l])
else:
error = error # activationDerivative(neuron_outputs[l+1],self.activations[l]) #because softmax derivative returns different dimensions
w_grad = np.repeat(error,self.layers[l]) #weights gradient
b_grad = np.ravel(error) #biases gradient
w_grad = w_grad*p_n
b_grad = b_grad
gradient_w.append(w_grad)
gradient_b.append(b_grad)
return(gradient_w,gradient_b)
Thanks for any answers, this is my first question here.
Using as reference this post for the computation of the gradient ( https://medium.com/#jonathan_hui/rl-policy-gradients-explained-9b13b688b146) :
It seems to me that totalRewardOfEpisode*np.log(probability of sampled action) is the right computation. However in order to have a good estimate of the gradient I'd suggest using many episodes to compute it. (30 for example, you'd just need to average your end gradient by dividing by 30)
The main difference with your test with totalReward*np.log(probs) is that for each step I think you should only backpropagate on the probability of the action you sampled, not the whole output. Initialy in the cited article they use the total reward but then they suggest in the end using the discounted reward of the present and future rewards as you do, so that part doesn't seem theoretically problematic.
OLD answer :
To my knowledge deepRL methods usely use some estimate of the value of the state in the game or the value of each action. From what I see in your code you have a neural network that only outputs probabilities for each action.
Although what you want is definitely to maximize the total reward, you can't compute a gradient on the end reward because of the environment. I'd suggest you'd look into methods such as deepQLearning or Actor/Critic based methods such as PPO.
Given the method you chose you'll get different answers on how to compute your gradient.
mprouveur's answer was half correct but I felt that I needed to explain the right thing to backpropagate. The answer to my question on ai.stackexchange.com was how I came to understand this. The correct error to backpropagate is the log probability of taking the action multiplied by the goal reward. This can also be calculated as the cross entropy loss between the outputted probabilities and an array of zeros with the action that was taken being one 1. Because of the derivative of cross entropy loss, this will have the effect of pushing only the probability of the action that was taken closer to one. Then, the multiplication of the total reward makes better actions get pushed more to a higher probability. So, with the label being a one-hot encoded vector, the correct equation is label/probs * totalReward because that is the derivative of cross entropy loss and the derivative of the log of probs. I got this working in other code, but even with this equation I think something else in my code is wrong. It probably has something to do with how I made the softmax derivative too complicated instead of calculating the usual way, by combing the cross entropy derivative and softmax derivative. I will update this answer soon with correct code and more information.
The loss here depends on what output on each problem. Generaly, loss for backpropagate should be a number that represents for everything you have processed. For policy gradient, it will be the reward that it think it will get compare with the original reward, the log is just a way to bring it back to a probabily random variable. Single dimension. If you want to inspect the behavior behind codes, you should always check the shape/dimension between each process to fully understand
I'm trying to program linear regression without much external help and I've done it successfully to an extent since my MSE usually returns a small number and the outputted line of best fit looks about right. I just have a question about the last line of code below. Does the optimizer also change the bias, and if so, is it by the learning rate?
#tf graph input, the 9 training values
X = tf.placeholder("float")
Y = tf.placeholder("float")
random = random.uniform(0,20)
#weights and biases
W = tf.Variable((random), name = "Weight")
b = tf.Variable((random), name = "Bias")
#linear model multiply x by weights and biases to get a y
pred = tf.add(tf.multiply(X, W), b)
#cost function to reduce the error. MSE
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
#minimize cost taking steps of 0.01 down the parabola
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
Yes, the optimizer changes the bias and the learning is done with respect to learning rate. Optimizers update all the trainable variables in the graph unless the var_list option is set (in which case they update the variables in that list).
I am using tensorflow to optimize a simple least squares objective function like the following:
Here, Y is the target vector ,X is the input matrix and vector w represents the weights to be learned.
Example Scenario:
, ,
If I wanted to augment the initial objective function to impose an additional constraint on w1 (the first scalar value in the tensorflow Variable w and X1 represents the first column of the feature matrix X), how would I achieve this in tensorflow?
One solution I can think of is to use tf.slice to index the first value of $w$ and add this in addition to the original cost term but I am not convinced that it will have the desired effect on the weights.
I would appreciate inputs on whether something like this is possible in tensorflow and if so, what the best ways to implement this might be?
An alternate option would be to add weight constraints, and do it using an augmented Lagrangian objective but I would first like to explore the regularization option before going the Lagrangian route.
The current code I have for the initial objective function without additional regularization is the following:
train_x ,train_y are the training data, training targets respectively.
test_x , test_y are the testing data, testing targets respectively.
#Sum of Squared Errs. Cost.
def costfunc(predicted,actual):
return tf.reduce_sum(tf.square(predicted - actual))
#Mean Squared Error Calc.
def prediction(sess,X,y_,test_x,test_y):
pred_y = sess.run(y_,feed_dict={X:test_x})
mymse = tf.reduce_mean(tf.square(pred_y - test_y))
mseval=sess.run(mymse)
return mseval,pred_y
with tf.Session() as sess:
X = tf.placeholder(tf.float32,[None,num_feat]) #Training Data
Y = tf.placeholder(tf.float32,[None,1]) # Target Values
W = tf.Variable(tf.ones([num_feat,1]),name="weights")
init = tf.global_variables_initializer()
sess.run(init)
#Tensorflow ops and cost function definitions.
y_ = tf.matmul(X,W)
cost_history = np.empty(shape=[1],dtype=float)
out_of_sample_cost_history = np.empty(shape=[1],dtype=float)
cost=costfunc(y_,Y)
learning_rate = 0.000001
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
for epoch in range(training_epochs):
sess.run(training_step,feed_dict={X:train_x,Y:train_y})
cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))
out_of_sample_cost_history = np.append(out_of_sample_cost_history,sess.run(cost,feed_dict={X:test_x,Y:test_y}))
MSETest,pred_test = prediction(sess,X,y_,test_x,test_y) #Predict on full testing set.
tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).
I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0
In Tensorflow, after I obtain my loss term, I give it to an optimizer and it adds the necessary differentiation and update terms to the computation graph:
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Decay rate.
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
I want a small modification to this process. I want to scale the learning_rate differently for my every distinct parameter in the network. For example, let A and B two different trainable parameters in the network and let dL/dA and dL/dB the partial derivatives of the parameters with respect to the loss. The momentum optimizer updates the variables as:
Ma <- 0.9*Ma + learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + learning_rate*dL/dB
B <- B - Mb
I want to modify this as:
Ma <- 0.9*Ma + ca*learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + cb*learning_rate*dL/dB
B <- B - Mb
Where ca and cb are special learning rate scales for different parameters. As far as I understand, Tensorflow has compute_gradients and apply_gradients methods we can call for such cases, but the documentation is not very clear about how to use them. Any help would be much appreciated.
TO calculate gradient:
self.gradients = tf.gradients(self.loss, tf.trainable_variables())
Now, you access the gradients using sess.run([model.gradients], feed_dict)
Assuming, you have declared the learning_rate as a tf.Variable(), you can assign the learning rate using the following code:
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** epoch)))
The above code is just an example. You can modify it to be used for your purpose.
Custom learning rate, in tensorflow
are very easy to handle.
learning_rate = tf.Variable(INITIAL_LR,trainable=False,name="lr")
and say l1 and l2 are two different learning rates :
l1 = ca * learning_rate
l2 = cb * learning_rate
you can do any type of mathematical manipulation with respect to learning rate, and apply it in this manner :
optimizer=tf.train.MomentumOptimizer(l1,0.9).minimize(network.finalLoss, global_step=global_counter)
Regarding your problem: what you want is actually different gradient for different layers, say L1 layer (trainable variables containing Ma) and L2
(trainable variables containing Mb)
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Dec
staircase=True)
optimizer1 = tf.train.MomentumOptimizer(ca * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L1)
optimizer2 = tf.train.MomentumOptimizer(cb * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L2)
optimizer = tf.group(optimizer1 , optimizer2)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
You can find the optimized version of the above code here
Please note if you can designate learning rate via tf.assign it returns the reference to the learning rate whereas the optimizer expects a float learning value type which probably will/should throw an error