Tensorflow reuse variables different before and after training

Tensorflow reuse variables different before and after training - python

In trying to learn a bit about Tensorflow, I had been building a Variational Auto Encoder, which is working, however I noticed that, after training, I was getting different results from the decoders which are sharing the same variables.
I created two decoders, because the first I train against my dataset, the second I want to eventually feed a new Z encoding in order to produce new values.
My check is that I shoud be able to send the Z values generated from the encoding process to both decoders and get equal results.
I have 2 Decoders (D, D_new). D_new shares the variable scope from D.
before training, I can send values into the Encoder (E) to generate output values as well as the Z values it generated (Z_gen).
if I use Z_gen as input to D_new before training then its output is identical to the output of D, which is expected.
After a few iterations of training, however, the output of D compared with D_new begins to diverge (although they are quite similar).
I have paired this down to a more simple version of my code which still reproduces the error. I'm wondering if others have found this to be the case and where I might be able to correct for it.
The below code can be run in a jupyter notebook. I'm using Tensorflow r0.11 and Python 3.5.0
import numpy as np
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import os
import pylab as pl
mgc = get_ipython().magic
mgc(u'matplotlib inline')
pl.rcParams['figure.figsize'] = (8.0, 5.0)
##-- Helper function Just for visualizing the data
def plot_values(values, file=None):
t = np.linspace(1.0,len(values[0]),len(values[0]))
for i in range(len(values)):
plt.plot(t,values[i])
if file is None:
plt.show()
else:
plt.savefig(file)
plt.close()
def encoder(input, n_hidden, n_z):
with tf.variable_scope("ENCODER"):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("Z"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_z
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Z", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
Z_gen = tf.matmul(prevLayer,w)
return Z_gen
def decoder(input, n_hidden, n_outputs, reuse=False):
with tf.variable_scope("DECODER", reuse=reuse):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("OUTPUT"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_outputs
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Output", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
out = tf.sigmoid(tf.matmul(prevLayer,w))
return out
Here is where the Tensorflow graph is setup:
batch_size = 3
n_inputs = 100
n_hidden_nodes = 12
n_z = 2
with tf.variable_scope("INPUT_VARS"):
with tf.name_scope("X"):
X = tf.placeholder(tf.float32, shape=(None, n_inputs))
with tf.name_scope("Z"):
Z = tf.placeholder(tf.float32, shape=(None, n_z))
Z_gen = encoder(X,n_hidden_nodes,n_z)
D = decoder(Z_gen, n_hidden_nodes, n_inputs)
D_new = decoder(Z, n_hidden_nodes, n_inputs, reuse=True)
with tf.name_scope("COST"):
loss = -tf.reduce_mean(X * tf.log(1e-6 + D) + (1-X) * tf.log(1e-6 + 1 - D))
train_step = tf.train.AdamOptimizer(0.001, beta1=0.5).minimize(loss)
I'm generating a training set of 3 samples of normal distribution noise with 100 data points and then sort it to more easily visualize:
train_data = (np.random.normal(0,1,(batch_size,n_inputs)) + 3) / 6.0
train_data.sort()
plot_values(train_data)
startup the session:
sess = tf.InteractiveSession()
sess.run(tf.group(tf.initialize_all_variables(), tf.initialize_local_variables()))
Lets just look at what the network initially generates before training...
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Pulling the Z generated values and feeding them to D_new which is reusing the variables from D:
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
Just for sanity I'll plot the difference between the two to be sure they're the same...
Now run 1000 training epochs and plot the result...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Now lets feed those same Z values to D_new and plot those results...
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
They look pretty similar. But (I think) they should be exactly the same. Let's look at the difference...
plot_values(resultA - resultB)
You can see there is some variation now. This becomes much more dramatic with a larger network on more complex data, but still shows up in this simple example.
Any clues as to what's going on?

There are some methods (don't know which one specifically) which can be supplied with a seed value. Besides those, I'm not even sure if the training process is completely deterministic, especially when the GPU is involved, simply by the nature of parallelization.
See this question.

While I don't have a full explanation for the reason why, I was able to resolve my issue by changing:
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
to...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
Note, that I simply ran and extracted the result and Z_vals one last time, without the train_step.
The reason I was still seeing problems in my more complex setup was that I had bias variables (even though they were set to 0.0) that were being generated with...
b = tf.Variable(tf.constant(self.bias_k, shape=[n_layer_outputs], dtype=tf.float32))
And that is somehow not considered while using reuse with a tf.variable_scope. So there were variables technically not being reused. Why they presented such a problem when set to 0.0 I'm not sure.

Related

Why not train GANs like this?

I'm new to generative networks and I decided to first try it on my own before seeing up a code. These are the steps I used to train my GAN.
[lib: tensorflow]
1) Train a discriminator on the dataset. (I used a dataset of 2 features with labels of either 'mediatating' or 'not meditating', dataset: https://drive.google.com/open?id=0B5DaSp-aTU-KSmZtVmFoc0hRa3c )
2) Once the the discriminator is trained, save it.
3) Make another file with for another feed forward network (or any other depending on your dataset). This feed forward network is the generator.
4) Once the generator is constructed, restore the discriminator and define a loss function for generator such that it learns to fool the discriminator. (this didn't work in tensorflow because sess.run() doesn't return a tf tensor and the path between G and D breaks but should work when done from scratch)
d_output = sess.run(graph.get_tensor_by_name('ol:0'), feed_dict={graph.get_tensor_by_name('features_placeholder:0'): g_output})
print(d_output)
optimize_for = tf.constant([[0.0]*10]) #not meditating
g_loss = -tf.reduce_mean((d_output - optimize_for)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(g_loss)
Why don't we train a generator like this? This seems so much simpler. It's true I couldn't manage to run this on tensorflow but this should be possible if I do from scratch.
Full code:
Discriminator:
import pandas as pd
import tensorflow as tf
from sklearn.utils import shuffle
data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_train.csv")
learning_rate = 0.001
batch_size = 1
n_epochs = 1000
n_examples = 999 # This is highly unsatisfying >:3
n_iteration = int(n_examples/batch_size)
features = tf.placeholder('float', [None, 2], name='features_placeholder')
labels = tf.placeholder('float', [None, 1], name = 'labels_placeholder')
weights = {
'ol': tf.Variable(tf.random_normal([2, 1]), name = 'w_ol')
}
biases = {
'ol': tf.Variable(tf.random_normal([1]), name = 'b_ol')
}
ol = tf.nn.sigmoid(tf.add(tf.matmul(features, weights['ol']), biases['ol']), name = 'ol')
loss = tf.reduce_mean((labels - ol)**2, name = 'loss')
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
ptr = 0
data = shuffle(data)
data_f = data.drop("lbl", axis = 1)
data_l = data.drop(["f1", "f2"], axis = 1)
for iteration in range(n_iteration):
epoch_x = data_f[ptr: ptr + batch_size]
epoch_y = data_l[ptr: ptr + batch_size]
ptr = ptr + batch_size
_, lss = sess.run([train, loss], feed_dict={features: epoch_x, labels:epoch_y})
print("Loss # epoch ", epoch, " = ", lss)
print("\nTesting...\n")
data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_eval.csv")
test_data_l = data.drop(["f1", "f2"], axis = 1)
test_data_f = data.drop("lbl", axis = 1)
print(sess.run(ol, feed_dict={features: test_data_f}))
print(test_data_l)
print("Saving model...")
saver = tf.train.Saver()
saver.save(sess, save_path="E:/workspace_py/saved_models/meditation_disciminative_model.ckpt")
sess.close()
Generator:
import tensorflow as tf
# hyper parameters
learning_rate = 0.1
# batch_size = 1
n_epochs = 100
from numpy import random
noise = random.rand(10, 2)
print(noise)
# Model
input_placeholder = tf.placeholder('float', [None, 2])
weights = {
'hl1': tf.Variable(tf.random_normal([2, 3]), name = 'w_hl1'),
'ol': tf.Variable(tf.random_normal([3, 2]), name = 'w_ol')
}
biases = {
'hl1': tf.Variable(tf.zeros([3]), name = 'b_hl1'),
'ol': tf.Variable(tf.zeros([2]), name = 'b_ol')
}
hl1 = tf.add(tf.matmul(input_placeholder, weights['hl1']), biases['hl1'])
ol = tf.add(tf.matmul(hl1, weights['ol']), biases['ol'])
sess = tf.Session()
sess.run(tf.global_variables_initializer())
g_output = sess.run(ol, feed_dict={input_placeholder: noise})
# restoring discriminator
saver = tf.train.import_meta_graph("E:/workspace_py/saved_models/meditation_disciminative_model.ckpt.meta")
saver.restore(sess, tf.train.latest_checkpoint('E:/workspace_py/saved_models/'))
graph = tf.get_default_graph()
d_output = sess.run(graph.get_tensor_by_name('ol:0'), feed_dict={graph.get_tensor_by_name('features_placeholder:0'): g_output})
print(d_output)
optimize_for = tf.constant([[0.0]*10])
g_loss = -tf.reduce_mean((d_output - optimize_for)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(g_loss)

The discriminator's purpose isn't to classify your original data, or really discriminate anything about your original data. Its sole purpose is to discriminate your generator's output from original output.
Think of an example of an art forger. Your dataset is all original paintings. Your generator network G is an art forger, and your discriminator D is a detective whose sole purpose in life is to find forgeries made by G.
D can't learn much just by looking at original paintings. What's really important for him is to figure out what sets G's forgeries apart from everything else. G can't make any money selling forgeries if all his pieces are discovered and marked as such by D, so he must learn how to thwart D.
This creates an environment where G is constantly trying to make his pieces look more "like" original artwork, and D is constantly getting better at finding the nuances to G's forgery style. The better D gets, the better G needs to be in order to make a living. They each get better at their task until they (theoretically) reach some Nash equilibrium defined by the complexity of the networks and the data they're trying to forge.
That's why D needs to be trained back-and-forth with G, because it needs to know and adapt to G's particular nuances (which change over time as G learns and adapts), not just find some average definition of "not forged". By making D hunt G specifically, you force G to become a better forger, and thus end up with a better generator network. If you just train D once, then G can learn some easy, obvious, unimportant way to beat D and never actually produce very good forgeries.

Tensorflow: weights aren't changing and cost set to 1.0

I tried to build a convolutional neural network but I have stumbled over some really strange problems.
first thing's first, here's my code:
import tensorflow as tf
import numpy as np
import matplotlib.image as mpimg
import glob
x = []
y = 1
for filename in glob.glob('trainig_data/*.jpg'):
im = mpimg.imread(filename)
x.append(im)
if len(x) == 10:
break
epochs = 5
weights = [tf.Variable(tf.random_normal([5,5,3,32],0.1)),
tf.Variable(tf.random_normal([5,5,32,64],0.1)),
tf.Variable(tf.random_normal([5,5,64,128],0.1)),
tf.Variable(tf.random_normal([75*75*128,1064],0.1)),
tf.Variable(tf.random_normal([1064,1],0.1))]
def CNN(x, weights):
output = tf.nn.conv2d([x], weights[0], [1,1,1,1], 'SAME')
output = tf.nn.relu(output)
output = tf.nn.conv2d(output, weights[1], [1,2,2,1], 'SAME')
output = tf.nn.relu(output)
output = tf.nn.conv2d(output, weights[2], [1,2,2,1], 'SAME')
output = tf.nn.relu(output)
output = tf.reshape(output, [-1,75*75*128])
output = tf.matmul(output, weights[3])
output = tf.nn.relu(output)
output = tf.matmul(output, weights[4])
output = tf.reduce_sum(output)
return output
sess = tf.Session()
prediction = CNN(tf.cast(x[0],tf.float32), weights)
cost = tf.reduce_mean(tf.square(prediction-y))
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
sess.run(init)
for e in range(epochs):
print('epoch:',e+1)
for x_i in x:
prediction = CNN(tf.cast(x_i,tf.float32), weights)
sess.run([cost, train])
print(sess.run(cost))
print('optimization finished!')
print(sess.run(prediction))
Now here are my problems:
The values of the weights and filters are not changing
The variable 'cost' is always 1.0
The prediction always puts out a 0
After doing some debugging I found out that the problem must come from the optimizer, because the cost and the prediction are not 1.0 and 0 before I put the weights trough the optimizer.
I hope that was enough information and that you can help me with my problem.

Try changing the way you initialise weights, use tf.truncated_normal to initialise weights. Refer answer, which states difference between tf.truncated_normal.
tf.truncted_normal: Outputs random values from a truncated normal distribution. The generated values follow a normal distribution with specified mean and standard deviation, except that values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked.
tf.random_normal: Outputs random values from a normal distribution.

the code seems to be odd. in the last line of your CNN function, you used the tf.reduce_mean to get a single value as an output, which is going to be a positive number (most probably greater than 1) (0, inf) as relu activation function output a positive value only for inputs lying on the positive x-axis. so, i think you should use tf.nn.softmax_with_logits() instead of tf.reduce_mean. Also try with using sigmoid activation function.

Tensorflow sawtooth memory usage during training loop iteration

I am training a neural network in a Python while loop which continues until some stopping condition is reached. I've noticed that when I train my network, I can see "sawtooth"/wave-like memory usage patterns, like this:
I've managed to reproduce this using a much simpler example than my production model. Obviously this is somewhat different as I don't update parameters, but I believe it replicates the behavior I'm seeing.
import tensorflow as tf
import numpy as np
def main(x_vals):
x = tf.placeholder(tf.float32, [500, 1000, 1000])
rs = tf.reduce_sum(x)
sess = tf.Session()
v = sess.run(rs, feed_dict={x:x_vals})
print(v)
if __name__ == "__main__":
x_vals = np.random.rand(500, 1000, 1000)
while True:
main(x_vals)
The size of the sawtooth seems to approximately scale with the size of the input data. Understandably, there appears to be one cycle per loop iteration.
What is happening here? Is Tensorflow copying over all of my data on each session evaluation? This isn't a problem per-se, but if I could avoid copying over my data on each training loop iteration (since my entire dataset fits in memory), I'd like to do that as I imagine the allocations are quite expensive. Have I diverged from best practices somewhere?

Using feed_dict will typically copy the data. Recently there new functionality was added that will avoid the copy, but you have to make sure that your data is word-aligned, see discussion in
https://github.com/tensorflow/tensorflow/issues/9690

I wanted to post a follow-up after a couple days of investigation. #Yaroslav pointed me in the correct direction, but there's a bit of color to the full answer.
One can limit some amount of memory allocation in each session by trying to avoid feed_dict by say, using a preloaded variable (if your dataset fits in memory). However, there's also some dynamic allocation done by the optimization computation graph, presumably to store gradients.
I've included code which demonstrates this. Here's a snippet of what the memory use looks like. On the left is repeated calling the preloaded function, while the right demonstrates repeatedly calling loaded_each_iteration.
import tensorflow as tf
import numpy as np
A_shape = [100000, 3000]
b_shape = [100000, 1]
x_shape = [3000, 1]
def loaded_each_iteration(A_vals):
A = tf.placeholder(tf.float32, A_shape)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run(x.initializer)
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost], feed_dict={A:A_vals})
print(c)
def preloaded(A_vals):
A_init = tf.placeholder(tf.float32, A_shape)
A = tf.Variable(A_init, trainable=False, collections=[], name='A', dtype=tf.float32)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run([A.initializer, x.initializer], feed_dict={A_init:A_vals})
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost])
print(c)
if __name__ == "__main__":
A_vals = np.random.rand(*A_shape)
while True:
loaded_each_iteration(A_vals)

Linear regression with tensorflow

I trying to understand linear regression... here is script that I tried to understand:
'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
from __future__ import print_function
import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
Question is what this part represent:
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
And why are there random float numbers?
Also could you show me some math with formals represents cost, pred, optimizer variables?

let's try to put up some intuition&sources together with the tfapproach.
General intuition:
Regression as presented here is a supervised learning problem. In it, as defined in Russel&Norvig's Artificial Intelligence, the task is:
given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym), where each output was generated by an unknown function y = f(x), discover a function h that approximates the true function f
For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. The hope is that the resulting function will be close to f.
But how to learn this parameters? in order to be able to learn, the model has to be able to evaluate. Here comes the cost (also called loss, energy, merit...) function to play: it is a metric function that compares the output of h with the corresponding y, and penalizes big differences.
Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function.
Linear Regression:
The example that you are posting performs a parametric linear regression, optimized with gradient descent based on the mean squared error as cost function. Which means:
Parametric: The set of parameters is fixed. They are held in the exact same memory placeholders thorough the learning process.
Linear: The output of h is merely a linear (actually, affine) combination between the input x and your parameters. So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b. Page 107 of the Deep Learning Book brings more quality insights and intuitions into that.
Cost function: Now this is the interesting part. The average squared error is a convex function. This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB).
Gradient descent: tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. If you totally need to know, the exact technique applied by TF is called automatic differentiation, kind of a compromise between the numeric and symbolic approaches. For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with. The random initialization is necessary in more complex architectures like neural networks. There is some extra code regarding the management of the minibatches, but I won't get into that because it is not the main focus of your question.
The TensorFlow approach:
Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). For constructing and running the graph, TensoFlow follows a declarative style, which means that the graph has to be first completely defined and compiled, before it is deployed and executed. It is very reccommended to read this short wiki article, if you haven't yet. In this context, the setup is split in two parts:
Firstly, you define your computational Graph, where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply.
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization.
The code:
The code of the example follows this approach closely:
Define the test data X and labels Y, and prepare a placeholder in the Graph for them (which is fed in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They have to be Variables because they will be updated during the Session.
Define pred (our hypothesis) and cost as explained before.
From this, the rest of the code should be clearer. Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that)
Cheers!
Andres
CODE EXAMPLES: GRADIENT DESCENT VS. NORMAL EQUATIONS
This small snippets generate simple multi-dimensional datasets and test both approaches. Notice that the normal equations approach doesn't require looping, and brings better results. For small dimensionality (DIMENSIONS<30k) is probably the preferred approach:
from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf
####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
# forward propagation
train_prediction = tf.matmul(x_train, theta)+theta_0
test_prediction = tf.matmul(x_test, theta) +theta_0
# cost function and optimizer
train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
# test results
test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)
# run the computation
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized"); print(theta.eval())
for step in range(TRAINING_STEPS):
_, train_c, test_c = s.run([optimizer, train_cost, test_cost],
feed_dict={x_train: train_data, y_train: train_labels,
x_test: test_data, y_test: test_labels })
if (step%100==0):
# it should return bias close to zero and parameters all close to 1 (see definition of f)
print("\nAfter", step, "iterations:")
#print(" Bias =", theta_0.eval(), ", Weights = ", theta.eval())
print(" train cost =", train_c); print(" test cost =", test_c)
PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
print("Solution for parameters:\n", PARAMETERS_GRADDESC)
####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
# optimum
optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)
# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized")
opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
PARAMETERS_NORMEQ = opt
print("Solution for parameters:\n",PARAMETERS_NORMEQ)
####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################
# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)

Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
The variable with type tf.Variable is the parameter which we will learn use TensorFlow. Assume you use the gradient descent to minimize the loss function. You need initial these parameter first. The rng.randn() is used to generate a random value for this purpose.
I think the Getting Started With TensorFlow is a good start point for you.

I'll first define the variables:
W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias)
Y is also a scalar value i.e. the value at X
pred = W (dot) X + b # dot here refers to dot product
# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples
#finally optimizer
# optimizer computes the gradient with respect to each variable and the update
W += learning_rate * (pred - Y)/num_samples * X
b += learning_rate * (pred - Y)/num_samples
Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. It isn't performing linear regression via least squares method although both will converge to the same solution.
Look here for more information: Getting Started

How to feed back RNN output to input in tensorflow

In case where suppose I have a trained RNN (e.g. language model), and I want to see what it would generate on its own, how should I feed its output back to its input?
I read the following related questions:
TensorFlow using LSTMs for generating text
TensorFlow LSTM Generative Model
Theoretically it is clear to me, that in tensorflow we use truncated backpropagation, so we have to define the max step which we would like to "trace". Also we reserve a dimension for batches, therefore if I'd like to train a sine wave, I have to feed [None, num_step, 1] inputs.
The following code works:
tf.reset_default_graph()
n_samples=100
state_size=5
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
X = tf.placeholder_with_default(zero_x, [None, n_samples, 1])
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Y = np.roll(def_x, 1)
loss = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
opt = tf.train.AdamOptimizer().minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Initial state run
plt.show(plt.plot(output.eval()[0]))
plt.plot(def_x.squeeze())
plt.show(plt.plot(pred.eval().squeeze()))
steps = 1001
for i in range(steps):
p, l, _= sess.run([pred, loss, opt])
The state size of the LSTM can be varied, also I experimented with feeding sine wave into the network and zeros, and in both cases it converged in ~500 iterations. So far I have understood that in this case the graph consists n_samples number of LSTM cells sharing their parameters, and it is only up to me that I feed input to them as a time series. However when generating samples the network is explicitly depending on its previous output - meaning that I cannot feed the unrolled model at once. I tried to compute the state and output at every step:
with tf.variable_scope('sine', reuse=True):
X_test = tf.placeholder(tf.float64)
X_reshaped = tf.reshape(X_test, [1, -1, 1])
output, last_states = tf.nn.dynamic_rnn(lstm_cell, X_reshaped, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
test_vals = [0.]
for i in range(1000):
val = pred.eval({X_test:np.array(test_vals)[None, :, None]})
test_vals.append(val)
However in this model it seems that there is no continuity between the LSTM cells. What is going on here?
Do I have to initialize a zero array with i.e. 100 time steps, and assign each run's result into the array? Like feeding the network with this:
run 0: input_feed = [0, 0, 0 ... 0]; res1 = result
run 1: input_feed = [res1, 0, 0 ... 0]; res2 = result
run 1: input_feed = [res1, res2, 0 ... 0]; res3 = result
etc...
What to do if I want to use this trained network to use its own output as its input in the following time step?

If I understood you correctly, you want to find a way to feed the output of time step t as input to time step t+1, right? To do so, there is a relatively easy work around that you can use at test time:
Make sure your input placeholders can accept a dynamic sequence length, i.e. the size of the time dimension is None.
Make sure you are using tf.nn.dynamic_rnn (which you do in the posted example).
Pass the initial state into dynamic_rnn.
Then, at test time, you can loop through your sequence and feed each time step individually (i.e. max sequence length is 1). Additionally, you just have to carry over the internal state of the RNN. See pseudo code below (the variable names refer to your code snippet).
I.e., change the definition of the model to something like this:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
X = tf.placeholder_with_default(zero_x, [None, None, 1]) # [batch_size, seq_length, dimension of input]
batch_size = tf.shape(self.input_)[0]
initial_state = lstm_cell.zero_state(batch_size, dtype=tf.float32)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64,
initial_state=initial_state)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Then you can perform inference like so:
fetches = {'final_state': last_state,
'prediction': pred}
toy_initial_input = np.array([[[1]]]) # put suitable data here
seq_length = 20 # put whatever is reasonable here for you
# get the output for the first time step
feed_dict = {X: toy_initial_input}
eval_out = sess.run(fetches, feed_dict)
outputs = [eval_out['prediction']]
next_state = eval_out['final_state']
for i in range(1, seq_length):
feed_dict = {X: outputs[-1],
initial_state: next_state}
eval_out = sess.run(fetches, feed_dict)
outputs.append(eval_out['prediction'])
next_state = eval_out['final_state']
# outputs now contains the sequence you want
Note that this can also work for batches, however it can be a bit more complicated if you sequences of different lengths in the same batch.
If you want to perform this kind of prediction not only at test time, but also at training time, it is also possible to do, but a bit more complicated to implement.

You can use its own output (last state) as the next-step input (initial state).
One way to do this is to:
use zero-initialized variables as the input state at every time step
each time you completed a truncated sequence and got some output state, update the state variables with this output state you just got.
The second can be done by either:
fetching the states to python and feeding them back next time, as done in the ptb example in tensorflow/models
build an update op in the graph and add a dependency, as done in the ptb example in tensorpack.

I know I'm a bit late to the party but I think this gist could be useful:
https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31
It lets you autofeed the input through a filter and back into the network as input. To make shapes match up processing can be set as a tf.layers.Dense layer.
Please ask any questions!
Edit:
In your particular case, create a lambda which performs the processing of the dynamic_rnn outputs into your character vector space. Ex:
# if you have:
W = tf.Variable( ... )
B = tf.Variable( ... )
Yo, Ho = tf.nn.dynamic_rnn( cell , inputs , state )
logits = tf.matmul(W, Yo) + B
...
# use self_feeding_rnn as
process_yo = lambda Yo: tf.matmul(W, Yo) + B
Yo, Ho = self_feeding_rnn( cell, seed, initial_state, processing=process_yo)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.