How to accumulate my loss over mini batches then calculate my gradient

How to accumulate my loss over mini batches then calculate my gradient - python

My main question is; is averaging the loss the same thing as averaging the gradient and how do i accumulate my loss over mini batches then calculate my gradient?
I have been trying to implement policy gradient in Tensorflow and run into the issue where i can not feed all my game states into my network at once and then update. The problem is if i lower my network size then train on all frames at once and take the mean of the loss then it begins to converge nicely. But if I accumulate the gradients over mini batches then average them, my gradients explode and i overflow my weights.
Any help or insight will be very appreciated.
Keep in mind also, this is my first time asking a question here.

What you can do is to accumulate gradients after each mini-batch and then update the weights based on gradient averages. Consider following simple case for fitting 50 Gaussian blobs with a single-layered perceptron:
from sklearn.datasets import make_blobs
import tensorflow as tf
import numpy as np
x_train, y_train = make_blobs(n_samples=50,
n_features=2,
centers=[[1, 1], [-1, -1]],
cluster_std=0.5)
with tf.name_scope('x'):
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.int32, [None])
with tf.name_scope('layer'):
logits = tf.layers.dense(x,
units=2,
kernel_initializer=tf.contrib.layers.xavier_initializer())
with tf.name_scope('loss'):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss_op = tf.reduce_mean(xentropy)
The minimize() method of the tensorflow optimizers calls compute_gradients() and then apply_gradients(). Instead of calling the minimize(), I'm going to call both methods directly. First, to get the gradients we call compute_gradients() (which returns a list of tuples grads_and_vars) and for apply_gradients() instead of gradients I'm going to feed placeholders for future gradient's averages:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
grads_and_vars = optimizer.compute_gradients(loss_op)
grads = [g for g, v in grads_and_vars]
# placeholders for gradients averages
placeholder_grads = [tf.placeholder(tf.float32, [None] + g.get_shape().as_list())
for g in grads]
new_grads_and_vars = [(tf.reduce_mean(p, axis=0), gv[1])
for p, gv in zip(placeholder_grads, grads_and_vars)]
apply_grads_op = optimizer.apply_gradients(new_grads_and_vars)
During mini-batches we only compute losses (you can accumulate losses as well - append to some list and then compute average) and gradients, without applying gradients to weights. At the end of each epoch we execute apply_grads_op operation while feeding accumulated gradients to its placeholders:
data = tf.data.Dataset.from_tensor_slices({'x':x_train, 'y':y_train}).batch(10)
iterator = data.make_initializable_iterator()
n_epochs = 2
with tf.Session() as sess:
_ = sess.run([tf.global_variables_initializer(), iterator.initializer])
next_batch = iterator.get_next()
for epoch in range(n_epochs):
epoch_grads = []
while True:
try:
batch = sess.run(next_batch)
evaled = sess.run([loss_op] + grads,
feed_dict={x:batch['x'], y:batch['y']})
epoch_grads.append(evaled[1:])
print('batch loss:', evaled[0])
except tf.errors.OutOfRangeError:
_ = sess.run(iterator.initializer)
feed_dict = {p:[g[i] for g in epoch_grads]
for i, p in enumerate(placeholder_grads)}
_ = sess.run(apply_grads_op, feed_dict=feed_dict)
break

Related

tf.keras GradientTape: get gradient with respect to input

Tensorflow version: Tensorflow 2.1
I want to get the gradients with respect to the input instead of the gradient with respect to the trainable weights. I adjust the example from https://www.tensorflow.org/guide/keras/train_and_evaluate to
import tensorflow as tf
import numpy as np
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, 'Not enough GPU hardware devices available'
tf.config.experimental.set_memory_growth(physical_devices[0], True)
def loss_fun(y_true, y_pred):
loss = tf.reduce_mean(tf.square(y_true - y_pred), axis=-1)
return loss
# Create a dataset
x = np.random.rand(10, 180, 320, 3).astype(np.float32)
y = np.random.rand(10, 1).astype(np.float32)
dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(1)
# Create a model
base_model = tf.keras.applications.MobileNet(input_shape=(180, 320, 3), weights=None, include_top=False)
x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.models.Model(inputs=base_model.input, outputs=output)
for input, target in dataset:
for iteration in range(400):
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
prediction = model(input, training=False) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fun(target, prediction)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.inputs)
print(grads) # output: [None]
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
optimizer.apply_gradients(zip(grads, model.inputs))
print('Iteration {}'.format(iteration))
However, this doesnot work, because grads = tape.gradient(loss_value, model.inputs) returns [None]. Is this intended behaviour or not? If yes, what is the recommended way to get the gradients with respect to the input?

To get it working two things needs to be added:
Converting image to a tf.Variable
Using tape.watch to watch the gradient with respect to the desired variable
image = tf.Variable(input)
for iteration in range(400):
with tf.GradientTape() as tape:
tape.watch(image)
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
prediction = model(image, training=False) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fun(target, prediction)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, image)
#print(grads) # output: [None]
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
optimizer.apply_gradients(zip([grads], [image]))
print('Iteration {}'.format(iteration))

Why does this RNN in tensorflow not learn?

I am trying to train an RNN without using the RNN API in tensorflow (2) in Python 3.7, so the code is very basic. Something is going really wrong, but I'm not sure what it is.
As a reference, I am using a dataset from this tensorflow tutorial so I know what the error should roughly converge to. My RNN code is the following. What it is trying to do is use the previous 20 timesteps to predict the value of a series at the 21st timestep. I am training in batches of size 256.
While there is a decrease in loss over time, the ceiling is approximately 10x what it is if I follow the tutorial approach. Could it be some problem with the backpropagation through time?
state_size = 20 #dimensionality of the network
BATCH_SIZE = 256
#define recurrent weights and biases. W has 1 more dimension that the state
#dimension as also processes the inputs
W = tf.Variable(np.random.rand(state_size+1, state_size), dtype=tf.float32)
b = tf.Variable(np.zeros((1,state_size)), dtype=tf.float32)
#weights and biases for the output
W2 = tf.Variable(np.random.rand(state_size, 1),dtype=tf.float32)
b2 = tf.Variable(np.zeros((1,1)), dtype=tf.float32)
init_state = tf.Variable(np.random.normal(size=[BATCH_SIZE,state_size]),dtype='float32')
optimizer = tf.keras.optimizers.Adam(1e-3)
losses = []
for epoch in range(20):
with tf.GradientTape() as tape:
loss = 0
for batch_idx in range(200):
current_state = init_state
batchx = x_train_uni[batch_idx*BATCH_SIZE:(batch_idx+1)*BATCH_SIZE].swapaxes(0,1)
batchy = y_train_uni[batch_idx*BATCH_SIZE:(batch_idx+1)*BATCH_SIZE]
#forward pass through the timesteps
for x in batchx:
inst = tf.concat([current_state,x],1) #concatenate state and inputs for that timepoint
current_state = tf.tanh(tf.matmul(inst, W) + b) #
#predict using the hidden state after the full forward pass
pred = tf.matmul(current_state,W2) + b2
loss += tf.reduce_mean(tf.abs(batchy-pred))
#get gradients with respect to parameters
gradients = tape.gradient(loss, [W,b,W2,b2])
#apply gradients
optimizer.apply_gradients(zip(gradients, [W,b,W2,b2]))
losses.append(loss)
print(loss)

MLP(ReLu) stops learning after few iterations. Tensor Flow

2 layers MLP (Relu) + Softmax
After 20 iterations, Tensor Flow just gives up and stops updating any weights or biases.
I initially thought that my ReLu where dying, so I displayed histograms to make sure none of them where 0. And none of them are !
They just stop changing after few iterations and cross entropy is still high. ReLu, Sigmoid and tanh gives the same results. Tweaking GradientDescentOptimizer from 0.01 to 0.5 also doesn't change much.
There has to be a bug somewhere. Like an actual bug in my code. I can't even overfit a small sample set !
Here are my histograms and here's my code, if anyone could check it out, that would be a major help.
We have 3000 scalars with 6 values between 0 and 255
to classify in two classes : [1,0] or [0,1]
(I made sure to randomise the order)
def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
with tf.name_scope(layer_name):
weights = tf.Variable(tf.truncated_normal([input_dim, output_dim], stddev=1.0 / math.sqrt(float(6))))
tf.summary.histogram('weights', weights)
biases = tf.Variable(tf.constant(0.4, shape=[output_dim]))
tf.summary.histogram('biases', biases)
preactivate = tf.matmul(input_tensor, weights) + biases
tf.summary.histogram('pre_activations', preactivate)
#act=tf.nn.relu
activations = act(preactivate, name='activation')
tf.summary.histogram('activations', activations)
return activations
#We have 3000 scalars with 6 values between 0 and 255 to classify in two classes
x = tf.placeholder(tf.float32, [None, 6])
y = tf.placeholder(tf.float32, [None, 2])
#After normalisation, input is between 0 and 1
normalised = tf.scalar_mul(1/255,x)
#Two layers
hidden1 = nn_layer(normalised, 6, 4, "hidden1")
hidden2 = nn_layer(hidden1, 4, 2, "hidden2")
#Finish by a softmax
softmax = tf.nn.softmax(hidden2)
#Defining loss, accuracy etc..
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=softmax))
tf.summary.scalar('cross_entropy', cross_entropy)
correct_prediction = tf.equal(tf.argmax(softmax, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', accuracy)
#Init session and writers and misc
session = tf.Session()
train_writer = tf.summary.FileWriter('log', session.graph)
train_writer.add_graph(session.graph)
init= tf.global_variables_initializer()
session.run(init)
merged = tf.summary.merge_all()
#Train
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
batch_x, batch_y = self.trainData
for _ in range(1000):
session.run(train_step, {x: batch_x, y: batch_y})
#Every 10 steps, add to the summary
if _ % 10 == 0:
s = session.run(merged, {x: batch_x, y: batch_y})
train_writer.add_summary(s, _)
#Evaluate
evaluate_x, evaluate_y = self.evaluateData
print(session.run(accuracy, {x: batch_x, y: batch_y}))
print(session.run(accuracy, {x: evaluate_x, y: evaluate_y}))
Hidden Layer 1. Output isn't zero, so that's not a dying ReLu problem. but still, weights are constant! TF didn't even try to modify them
Same for Hidden Layer 2. TF tried tweaking them a bit and gave up pretty fast.
Cross entropy does decrease, but stays staggeringly high.
EDIT :
LOTS of mistakes in my code.
First one is 1/255 = 0 in python... Changed it to 1.0/255.0 and my code started to live.
So basically, my input was multiplied by 0 and the neural network just was purely blind. So he tried to get the best result he could while being blind and then gave up. Which explains totally it's reaction.
Now I was applying a softmax twice... Modifying it helped also.
And by strying different learning rates and different number of epoch I finally found something good.
Here is the final working code :
def runModel(self):
def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
with tf.name_scope(layer_name):
#This is standard weight for neural networks with ReLu.
#I divide by math.sqrt(float(6)) because my input has 6 values
weights = tf.Variable(tf.truncated_normal([input_dim, output_dim], stddev=1.0 / math.sqrt(float(6))))
tf.summary.histogram('weights', weights)
#I chose this bias myself. It work. Not sure why.
biases = tf.Variable(tf.constant(0.4, shape=[output_dim]))
tf.summary.histogram('biases', biases)
preactivate = tf.matmul(input_tensor, weights) + biases
tf.summary.histogram('pre_activations', preactivate)
#Some neurons will have ReLu as activation function
#Some won't have any activation functions
if act == "None":
activations = preactivate
else :
activations = act(preactivate, name='activation')
tf.summary.histogram('activations', activations)
return activations
#We have 3000 scalars with 6 values between 0 and 255 to classify in two classes
x = tf.placeholder(tf.float32, [None, 6])
y = tf.placeholder(tf.float32, [None, 2])
#After normalisation, input is between 0 and 1
#Normalising input really helps. Nothing is doable without it
#But my ERROR was to write 1/255. Becase in python
#1/255 = 0 .... (integer division)
#But 1.0/255.0 = 0,003921568 (float division)
normalised = tf.scalar_mul(1.0/255.0,x)
#Three layers total. The first one is just a matrix multiplication
input = nn_layer(normalised, 6, 4, "input", act="None")
#The second one has a ReLu after a matrix multiplication
hidden1 = nn_layer(input, 4, 4, "hidden", act=tf.nn.relu)
#The last one is also jsut a matrix multiplcation
#WARNING ! No softmax here ! Because later we call a function
#That implicitly does a softmax
#And it's bad practice to do two softmax one after the other
output = nn_layer(hidden1, 4, 2, "output", act="None")
#Tried different learning rates
#Higher learning rate means find a result faster
#But could be a local minimum
#Lower learning rate means we need much more epochs
learning_rate = 0.03
with tf.name_scope('learning_rate_'+str(learning_rate)):
#Defining loss, accuracy etc..
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=output))
tf.summary.scalar('cross_entropy', cross_entropy)
correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', accuracy)
#Init session and writers and misc
session = tf.Session()
train_writer = tf.summary.FileWriter('log', session.graph)
train_writer.add_graph(session.graph)
init= tf.global_variables_initializer()
session.run(init)
merged = tf.summary.merge_all()
#Train
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
batch_x, batch_y = self.trainData
for _ in range(1000):
session.run(train_step, {x: batch_x, y: batch_y})
#Every 10 steps, add to the summary
if _ % 10 == 0:
s = session.run(merged, {x: batch_x, y: batch_y})
train_writer.add_summary(s, _)
#Evaluate
evaluate_x, evaluate_y = self.evaluateData
print(session.run(accuracy, {x: batch_x, y: batch_y}))
print(session.run(accuracy, {x: evaluate_x, y: evaluate_y}))

I'm afraid that you have to reduce your learning rate. It's to high. High learning rate usually leads you to local minimum not global one.
Try 0.001, 0.0001 or even 0.00001. Or make your learning rate flexible.
I did not checked the code, so firstly try to tune LR.

Just incase someone needs it in the future:
I had initialized my dual layer network's layers with np.random.randn but the network refused to learn. Using the He (for ReLU) and Xavier(for softmax) initializations totally worked.

Tensorflow: Number of nodes in the graph keeps increasing as training goes on

I'm training a convolutional model in tensorflow. After training the model for about 70 epochs, which took almost 1.5 hrs, I couldn't save the model. It gave me ValueError: GraphDef cannot be larger than 2GB. I found that as the training proceeds the number of nodes in my graph keeps increasing.
At epochs 0,3,6,9, the number of nodes in the graph are 7214, 7238, 7262, 7286 respectively. When I use with tf.Session() as sess:, instead of passing the session as sess = tf.Session(), the number of nodes are 3982, 4006, 4030, 4054 at epochs 0,3,6,9 respectively.
In this answer, it is said that as nodes get added to the graph, it can exceed its maximum size. I need help with understanding how the number of nodes keep going up in my graph.
I train my model using the code below:
def runModel(data):
'''
Defines cost, optimizer functions, and runs the graph
'''
X, y,keep_prob = modelInputs((755, 567, 1),4)
logits = cnnModel(X,keep_prob)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="cost")
optimizer = tf.train.AdamOptimizer(.0001).minimize(cost)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1), name="correct_pred")
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
for e in range(12):
batch_x, batch_y = data.next_batch(30)
x = tf.reshape(batch_x, [30, 755, 567, 1]).eval(session=sess)
batch_y = tf.one_hot(batch_y,4).eval(session=sess)
sess.run(optimizer, feed_dict={X: x, y: batch_y,keep_prob:0.5})
if e%3==0:
n = len([n.name for n in tf.get_default_graph().as_graph_def().node])
print("No.of nodes: ",n,"\n")
current_cost = sess.run(cost, feed_dict={X: x, y: batch_y,keep_prob:1.0})
acc = sess.run(accuracy, feed_dict={X: x, y: batch_y,keep_prob:1.0})
print("At epoch {epoch:>3d}, cost is {a:>10.4f}, accuracy is {b:>8.5f}".format(epoch=e, a=current_cost, b=acc))
What causes an increase in the number of nodes?

You are creating new nodes within your training loop. In particular, you are calling tf.reshape and tf.one_hot, each of which creates one (or more) nodes. You can either:
Create those nodes outside of the graph using placeholders as inputs, and then only evaluate them in the loop.
Not use TensorFlow for those operations and use instead NumPy or equivalent operations.
I would recommend the second one, since there does not seem to be any benefit in using TensorFlow for data preparation. You can have something like:
import numpy as np
# ...
x = np.reshape(batch_x, [30, 755, 567, 1])
# ...
# One way of doing one-hot encoding with NumPy
classes_arr = np.arange(4).reshape([1] * batch_y.ndims + [-1])
batch_y = (np.expand_dims(batch_y, -1) == classes_arr).astype(batch_y.dtype)
# ...
PD: I'd also recommend using tf.Session() in a with context manager to make sure its close() method is called at the end (unless you want to keep using the same session later).

Another option, that solved a similar problem for me, is to use tf.reset_default_graph()

Linear regression with tensorflow

I trying to understand linear regression... here is script that I tried to understand:
'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
from __future__ import print_function
import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
Question is what this part represent:
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
And why are there random float numbers?
Also could you show me some math with formals represents cost, pred, optimizer variables?

let's try to put up some intuition&sources together with the tfapproach.
General intuition:
Regression as presented here is a supervised learning problem. In it, as defined in Russel&Norvig's Artificial Intelligence, the task is:
given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym), where each output was generated by an unknown function y = f(x), discover a function h that approximates the true function f
For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. The hope is that the resulting function will be close to f.
But how to learn this parameters? in order to be able to learn, the model has to be able to evaluate. Here comes the cost (also called loss, energy, merit...) function to play: it is a metric function that compares the output of h with the corresponding y, and penalizes big differences.
Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function.
Linear Regression:
The example that you are posting performs a parametric linear regression, optimized with gradient descent based on the mean squared error as cost function. Which means:
Parametric: The set of parameters is fixed. They are held in the exact same memory placeholders thorough the learning process.
Linear: The output of h is merely a linear (actually, affine) combination between the input x and your parameters. So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b. Page 107 of the Deep Learning Book brings more quality insights and intuitions into that.
Cost function: Now this is the interesting part. The average squared error is a convex function. This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB).
Gradient descent: tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. If you totally need to know, the exact technique applied by TF is called automatic differentiation, kind of a compromise between the numeric and symbolic approaches. For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with. The random initialization is necessary in more complex architectures like neural networks. There is some extra code regarding the management of the minibatches, but I won't get into that because it is not the main focus of your question.
The TensorFlow approach:
Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). For constructing and running the graph, TensoFlow follows a declarative style, which means that the graph has to be first completely defined and compiled, before it is deployed and executed. It is very reccommended to read this short wiki article, if you haven't yet. In this context, the setup is split in two parts:
Firstly, you define your computational Graph, where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply.
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization.
The code:
The code of the example follows this approach closely:
Define the test data X and labels Y, and prepare a placeholder in the Graph for them (which is fed in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They have to be Variables because they will be updated during the Session.
Define pred (our hypothesis) and cost as explained before.
From this, the rest of the code should be clearer. Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that)
Cheers!
Andres
CODE EXAMPLES: GRADIENT DESCENT VS. NORMAL EQUATIONS
This small snippets generate simple multi-dimensional datasets and test both approaches. Notice that the normal equations approach doesn't require looping, and brings better results. For small dimensionality (DIMENSIONS<30k) is probably the preferred approach:
from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf
####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
# forward propagation
train_prediction = tf.matmul(x_train, theta)+theta_0
test_prediction = tf.matmul(x_test, theta) +theta_0
# cost function and optimizer
train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
# test results
test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)
# run the computation
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized"); print(theta.eval())
for step in range(TRAINING_STEPS):
_, train_c, test_c = s.run([optimizer, train_cost, test_cost],
feed_dict={x_train: train_data, y_train: train_labels,
x_test: test_data, y_test: test_labels })
if (step%100==0):
# it should return bias close to zero and parameters all close to 1 (see definition of f)
print("\nAfter", step, "iterations:")
#print(" Bias =", theta_0.eval(), ", Weights = ", theta.eval())
print(" train cost =", train_c); print(" test cost =", test_c)
PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
print("Solution for parameters:\n", PARAMETERS_GRADDESC)
####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
# optimum
optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)
# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized")
opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
PARAMETERS_NORMEQ = opt
print("Solution for parameters:\n",PARAMETERS_NORMEQ)
####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################
# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)

Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
The variable with type tf.Variable is the parameter which we will learn use TensorFlow. Assume you use the gradient descent to minimize the loss function. You need initial these parameter first. The rng.randn() is used to generate a random value for this purpose.
I think the Getting Started With TensorFlow is a good start point for you.

I'll first define the variables:
W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias)
Y is also a scalar value i.e. the value at X
pred = W (dot) X + b # dot here refers to dot product
# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples
#finally optimizer
# optimizer computes the gradient with respect to each variable and the update
W += learning_rate * (pred - Y)/num_samples * X
b += learning_rate * (pred - Y)/num_samples
Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. It isn't performing linear regression via least squares method although both will converge to the same solution.
Look here for more information: Getting Started

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to accumulate my loss over mini batches then calculate my gradient - python

Related

tf.keras GradientTape: get gradient with respect to input

Why does this RNN in tensorflow not learn?

MLP(ReLu) stops learning after few iterations. Tensor Flow

Tensorflow: Number of nodes in the graph keeps increasing as training goes on

Linear regression with tensorflow

Categories

Resources