Tensorflow sawtooth memory usage during training loop iteration

Tensorflow sawtooth memory usage during training loop iteration - python

I am training a neural network in a Python while loop which continues until some stopping condition is reached. I've noticed that when I train my network, I can see "sawtooth"/wave-like memory usage patterns, like this:
I've managed to reproduce this using a much simpler example than my production model. Obviously this is somewhat different as I don't update parameters, but I believe it replicates the behavior I'm seeing.
import tensorflow as tf
import numpy as np
def main(x_vals):
x = tf.placeholder(tf.float32, [500, 1000, 1000])
rs = tf.reduce_sum(x)
sess = tf.Session()
v = sess.run(rs, feed_dict={x:x_vals})
print(v)
if __name__ == "__main__":
x_vals = np.random.rand(500, 1000, 1000)
while True:
main(x_vals)
The size of the sawtooth seems to approximately scale with the size of the input data. Understandably, there appears to be one cycle per loop iteration.
What is happening here? Is Tensorflow copying over all of my data on each session evaluation? This isn't a problem per-se, but if I could avoid copying over my data on each training loop iteration (since my entire dataset fits in memory), I'd like to do that as I imagine the allocations are quite expensive. Have I diverged from best practices somewhere?

Using feed_dict will typically copy the data. Recently there new functionality was added that will avoid the copy, but you have to make sure that your data is word-aligned, see discussion in
https://github.com/tensorflow/tensorflow/issues/9690

I wanted to post a follow-up after a couple days of investigation. #Yaroslav pointed me in the correct direction, but there's a bit of color to the full answer.
One can limit some amount of memory allocation in each session by trying to avoid feed_dict by say, using a preloaded variable (if your dataset fits in memory). However, there's also some dynamic allocation done by the optimization computation graph, presumably to store gradients.
I've included code which demonstrates this. Here's a snippet of what the memory use looks like. On the left is repeated calling the preloaded function, while the right demonstrates repeatedly calling loaded_each_iteration.
import tensorflow as tf
import numpy as np
A_shape = [100000, 3000]
b_shape = [100000, 1]
x_shape = [3000, 1]
def loaded_each_iteration(A_vals):
A = tf.placeholder(tf.float32, A_shape)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run(x.initializer)
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost], feed_dict={A:A_vals})
print(c)
def preloaded(A_vals):
A_init = tf.placeholder(tf.float32, A_shape)
A = tf.Variable(A_init, trainable=False, collections=[], name='A', dtype=tf.float32)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run([A.initializer, x.initializer], feed_dict={A_init:A_vals})
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost])
print(c)
if __name__ == "__main__":
A_vals = np.random.rand(*A_shape)
while True:
loaded_each_iteration(A_vals)

Related

Tensorflow and reading binary data properly

I am trying to properly read in my own binary data to Tensorflow based on Fixed length records section of this tutorial, and by looking at the read_cifar10 function here. Mind you I am new to tensorflow, so my understanding may be off.
My Data
My files are binary with float32 type. The first 32 bit sample is the label, and the remaining 256 samples are the data. I want to reshape the data at the end to a [2, 128] matrix.
My Code So far:
import tensorflow as tf
import os
def read_data(filename_queue):
item_type = tf.float32
label_items = 1
data_items = 256
label_bytes = label_items * item_type.size
data_bytes = data_items * item_type.size
record_bytes = label_bytes + data_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_data = tf.decode_raw(value, item_type)
# labels = tf.cast(tf.strided_slice(record_data, [0], [label_items]), tf.int32)
label = tf.strided_slice(record_data, [0], [label_items])
data0 = tf.strided_slice(record_data, [label_items], [label_items + data_items])
data = tf.reshape(data0, [2, data_items/2])
return data, label
if __name__ == '__main__':
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set GPU device
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
(x, y) = read_data(filename_queue)
print(y.eval())
This code hands at the print(y.eval()), but I fear I have much bigger issues than that.
Question:
When I execute this, I get a data and label tensor returned. The problem is I don't quite understand how to actually read the data from the tensor. For example, I understand the autoencoder example here, however this has a mnist.train.next_batch(batch_size) function that is called to read the next batch. Do I need to write that for my function, or is it handled by something internal to my read_data() function. If I need to write that function, what does it look like?
Are their any other obvious things I'm missing? My goal in using this method is to reduce I/O overhead, and not store all of the data in memory, since my file are quite large.
Thanks in advance.

Yes. You are pretty much done. At this point you need to:
1) Write your neural network model model which is supposed to take your data and return a label.
2) Write your cost function C which takes the network prediction and the true label and gives you a cost.
3) Choose and optimizer.
4) Put everything together:
opt = tf.AdamOptimizer(learning_rate=0.001)
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[data, label], batch_size=128)
y_pred = model(data)
loss = C(label, y_pred)
After which you iterate and minimize the loss with:
opt.minimize(loss)
See also tf.train.string_input_producer behavior in a loop for related information.

Tensorflow: Why my code is running slower and slower?

I am new to tensorflow. The following code can run successfully, without any error. In the first 10 lines of output, the computation is fast, and the output (defined in the last line) flies line by line. However, as the iteration goes up, the computation become slower and slower, and finally become intolerable. So I wonder whether there are any modifications that can speed this up.
Here is a brief description of this code:
This code apply the single hidden-layer neural network to the dataset. It aims to find the best parameter for rate[0] and rate[1], which are parameters that will effect the loss function. During each step of training, one tuple is fed to the model, and the accuracy of the tuple is immediately evaluated (this kind of data comes as a stream in real world).
import tensorflow as tf
import numpy as np
n_hidden=50
n_input=37
n_output=2
data_raw=np.genfromtxt(r'data.csv',delimiter=",",dtype=None)
data_info=np.genfromtxt(r'data2.csv',delimiter=",",dtype=None)
def pre_process( tuple):
ans = []
temp = [0 for i in range(24)]
temp[int(tuple[0])] = 1
# np.append(ans,np.array(temp))
ans.extend(temp)
temp = [0 for i in range(7)]
temp[int(tuple[1]) - 1] = 1
ans.extend(temp)
# np.append(ans,np.array(temp))
temp = [0 for i in range(3)]
temp[int(tuple[3])] = 1
ans.extend(temp)
temp = [0 for i in range(2)]
temp[int(tuple[4])] = 1
ans.extend(temp)
ans.extend([int(tuple[5])])
return np.array(ans)
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
b1=tf.Variable(tf.zeros([n_hidden]))
W2=tf.Variable(tf.zeros([n_hidden,n_output]))
b2=tf.Variable(tf.zeros([n_output]))
logits_1 = tf.matmul(x, W1) + b1
relu_layer= tf.nn.relu(logits_1)
logits_2 = tf.matmul(relu_layer, W2) + b2
correct_prediction = tf.equal(tf.argmax(logits_2,1), tf.argmax(y_,0))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
for j in range(-100,i,10):
rate[1]=j
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits_2)*[rate[0],rate[1]])
# loss2=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_r, logits=logits_2)*[rate[2],rate[3]])
# loss=loss1+loss2
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
data_line=1
accur=0
local_local=0
remote_remote=0
local_remote=0
remote_local=0
total=0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(200):
# print(int(data_raw[data_line][0]),data_info[i][0])
if i>100:
total+=1
if int(data_raw[data_line][0])==data_info[i][0]:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),y_:[1,0],y_r:[0,1]})
# print(sess.run(logits_2,{x:pre_process(data_info[i]).reshape(1,-1), y_: #[1,0]}))
data_line+=1;
if data_line==len(data_raw):
break
if i>100:
acc=accuracy.eval(feed_dict={x: pre_process(data_info[i]).reshape(1,-1), y_: [1,0], y_r:[0,1]})
local_local+=acc
local_remote+=1-acc
accur+=acc
else:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),y_:[0,1], y_r:[1,0]})
# print(sess.run(logits_2,{x: pre_process(data_info[i]).reshape(1,-1), y_: #[0,1]}))
if i>100:
acc=accuracy.eval(feed_dict={x: pre_process(data_info[i]).reshape(1,-1), y_: [0,1], y_r:[1,0]})
remote_remote+=acc
remote_local+=1-acc
accur+=acc
print("correctness: (%.3d,%.3d): \t%.2f %.2f %.2f %.2f %.2f" % (rate[0],rate[1],accur/total,local_local/total,local_remote/total,remote_local/total,remote_remote/total))

Though GPhilo's answer addresses the issue why running the code is getting slower and slower, but in reality, that solution will result in creation of computation graph again and again which is not good.
The following two lines of code, (GPhilo has also mentioned) are continuously adding operations to your graph for each iteration.
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits( \
labels=y_, logits=logits_2)*[rate[0],rate[1]])
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
As I can see, you are having two values rate[0], rate[1] which needs to be supplied to your graph. Why are you not supplying these two values through placeholder and define your graph only once. Once you start running Session you shouldn't add more operations in your graph. Also, you shouldn't be considering initializing your Session for iteration.
Check this modified code (only important parts)
# To clear previously created graph (if any) present in memory.
tf.reset_default_graph()
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
# Add these two placeholders (Assuming they are single float value)
rate0 = tf.placeholder(tf.float32, shape = [])
rate1 = tf.placeholder(tf.float32, shape = [])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
....
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Bring this code outside from loop (Note replacement of rate[0] with placeholder)
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_, \
logits=logits_2) * [rate0, rate1])
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
# Instantiate session only once.
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Move the subsequent looping code inside.
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
After this modification, whenever your Session runs train_step, you need to supply these two extra placeholders in your feed_dict.
Ex:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),
y_:[1,0],y_r:[0,1], rate0: rate[0], rate1: rate[1]})
In this way, you will not be creating graph for every iteration and in fact this code will be faster than GPhilo's solution.

Every time you run train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss) you're adding (quite some) operations to your graph, which becomes bigger and bigger with more loops of your program. The bigger the graph, the slower the execution.
Put your model definition in the loops' body and call tf.reset_default_graph() each time you start a new iteration:
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
for j in range(-100,i,10):
tf.reset_default_graph()
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
b1=tf.Variable(tf.zeros([n_hidden]))
W2=tf.Variable(tf.zeros([n_hidden,n_output]))
b2=tf.Variable(tf.zeros([n_output]))
logits_1 = tf.matmul(x, W1) + b1
relu_layer= tf.nn.relu(logits_1)
logits_2 = tf.matmul(relu_layer, W2) + b2
correct_prediction = tf.equal(tf.argmax(logits_2,1), tf.argmax(y_,0))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
rate[1]=j
#...

Why not train GANs like this?

I'm new to generative networks and I decided to first try it on my own before seeing up a code. These are the steps I used to train my GAN.
[lib: tensorflow]
1) Train a discriminator on the dataset. (I used a dataset of 2 features with labels of either 'mediatating' or 'not meditating', dataset: https://drive.google.com/open?id=0B5DaSp-aTU-KSmZtVmFoc0hRa3c )
2) Once the the discriminator is trained, save it.
3) Make another file with for another feed forward network (or any other depending on your dataset). This feed forward network is the generator.
4) Once the generator is constructed, restore the discriminator and define a loss function for generator such that it learns to fool the discriminator. (this didn't work in tensorflow because sess.run() doesn't return a tf tensor and the path between G and D breaks but should work when done from scratch)
d_output = sess.run(graph.get_tensor_by_name('ol:0'), feed_dict={graph.get_tensor_by_name('features_placeholder:0'): g_output})
print(d_output)
optimize_for = tf.constant([[0.0]*10]) #not meditating
g_loss = -tf.reduce_mean((d_output - optimize_for)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(g_loss)
Why don't we train a generator like this? This seems so much simpler. It's true I couldn't manage to run this on tensorflow but this should be possible if I do from scratch.
Full code:
Discriminator:
import pandas as pd
import tensorflow as tf
from sklearn.utils import shuffle
data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_train.csv")
learning_rate = 0.001
batch_size = 1
n_epochs = 1000
n_examples = 999 # This is highly unsatisfying >:3
n_iteration = int(n_examples/batch_size)
features = tf.placeholder('float', [None, 2], name='features_placeholder')
labels = tf.placeholder('float', [None, 1], name = 'labels_placeholder')
weights = {
'ol': tf.Variable(tf.random_normal([2, 1]), name = 'w_ol')
}
biases = {
'ol': tf.Variable(tf.random_normal([1]), name = 'b_ol')
}
ol = tf.nn.sigmoid(tf.add(tf.matmul(features, weights['ol']), biases['ol']), name = 'ol')
loss = tf.reduce_mean((labels - ol)**2, name = 'loss')
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
ptr = 0
data = shuffle(data)
data_f = data.drop("lbl", axis = 1)
data_l = data.drop(["f1", "f2"], axis = 1)
for iteration in range(n_iteration):
epoch_x = data_f[ptr: ptr + batch_size]
epoch_y = data_l[ptr: ptr + batch_size]
ptr = ptr + batch_size
_, lss = sess.run([train, loss], feed_dict={features: epoch_x, labels:epoch_y})
print("Loss # epoch ", epoch, " = ", lss)
print("\nTesting...\n")
data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_eval.csv")
test_data_l = data.drop(["f1", "f2"], axis = 1)
test_data_f = data.drop("lbl", axis = 1)
print(sess.run(ol, feed_dict={features: test_data_f}))
print(test_data_l)
print("Saving model...")
saver = tf.train.Saver()
saver.save(sess, save_path="E:/workspace_py/saved_models/meditation_disciminative_model.ckpt")
sess.close()
Generator:
import tensorflow as tf
# hyper parameters
learning_rate = 0.1
# batch_size = 1
n_epochs = 100
from numpy import random
noise = random.rand(10, 2)
print(noise)
# Model
input_placeholder = tf.placeholder('float', [None, 2])
weights = {
'hl1': tf.Variable(tf.random_normal([2, 3]), name = 'w_hl1'),
'ol': tf.Variable(tf.random_normal([3, 2]), name = 'w_ol')
}
biases = {
'hl1': tf.Variable(tf.zeros([3]), name = 'b_hl1'),
'ol': tf.Variable(tf.zeros([2]), name = 'b_ol')
}
hl1 = tf.add(tf.matmul(input_placeholder, weights['hl1']), biases['hl1'])
ol = tf.add(tf.matmul(hl1, weights['ol']), biases['ol'])
sess = tf.Session()
sess.run(tf.global_variables_initializer())
g_output = sess.run(ol, feed_dict={input_placeholder: noise})
# restoring discriminator
saver = tf.train.import_meta_graph("E:/workspace_py/saved_models/meditation_disciminative_model.ckpt.meta")
saver.restore(sess, tf.train.latest_checkpoint('E:/workspace_py/saved_models/'))
graph = tf.get_default_graph()
d_output = sess.run(graph.get_tensor_by_name('ol:0'), feed_dict={graph.get_tensor_by_name('features_placeholder:0'): g_output})
print(d_output)
optimize_for = tf.constant([[0.0]*10])
g_loss = -tf.reduce_mean((d_output - optimize_for)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(g_loss)

The discriminator's purpose isn't to classify your original data, or really discriminate anything about your original data. Its sole purpose is to discriminate your generator's output from original output.
Think of an example of an art forger. Your dataset is all original paintings. Your generator network G is an art forger, and your discriminator D is a detective whose sole purpose in life is to find forgeries made by G.
D can't learn much just by looking at original paintings. What's really important for him is to figure out what sets G's forgeries apart from everything else. G can't make any money selling forgeries if all his pieces are discovered and marked as such by D, so he must learn how to thwart D.
This creates an environment where G is constantly trying to make his pieces look more "like" original artwork, and D is constantly getting better at finding the nuances to G's forgery style. The better D gets, the better G needs to be in order to make a living. They each get better at their task until they (theoretically) reach some Nash equilibrium defined by the complexity of the networks and the data they're trying to forge.
That's why D needs to be trained back-and-forth with G, because it needs to know and adapt to G's particular nuances (which change over time as G learns and adapts), not just find some average definition of "not forged". By making D hunt G specifically, you force G to become a better forger, and thus end up with a better generator network. If you just train D once, then G can learn some easy, obvious, unimportant way to beat D and never actually produce very good forgeries.

Evaluating Tensorflow operation is very slow in a loop

I'm trying to learn tensorflow by coding up some simple problems: I was trying to find the value of pi using a direct sampling Monte Carlo method.
The run time is much longer than I thought it would be when using a for loop to do this. I've seen other posts about similar things and I've tried to follow the solutions, but I think I still must be doing something wrong.
Attached below is my code:
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
x = tf.random_uniform(shape=(), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
init_op = tf.global_variables_initializer()
sess.run(init_op)
# Make sure no new nodes are added to the graph
sess.graph.finalize()
start = time.time()
# Run monte carlo trials -- This is very slow
for _ in range(n_trials):
sess.run(hit_op)
hits = hit.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))
>>> Pi is 3.15208
>>> Tensorflow operation took 8.98 s
In comparison, doing a for loop type solution in numpy is an order of magnitude faster
start = time.time()
hits = [ 1 if np.sqrt(np.sum(np.square(np.random.uniform(size=2)))) < 1 else 0 for _ in range(n_trials) ]
a = 0
for hit in hits:
a+=hit
print("numpy operation took {:.2f} s".format((time.time()-start)))
print("Pi is {}".format(4*a/n_trials))
>>> Pi is 3.14032
>>> numpy operation took 0.75 s
Attached below is a plot of the difference in overall executioin times for various numbers of trials.
Please note: my question is not about "how to perform this task the fastest", I recognize there are much more effective ways of calculating Pi. I've only used this as a benchmarking tool to check the performance of tensorflow against something I'm familiar with (numpy).

The slow in speed has got to do with some communication overhead between Python and Tensorflow in sess.run, which is executed multiple times inside your loop. I would suggest using tf.while_loop to execute the computations within Tensorflow. That would be a better comparison over numpy.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
hit = tf.Variable(0, name='hit')
def body(ctr):
x = tf.random_uniform(shape=[2], name='x')
r = tf.sqrt(tf.reduce_sum(tf.square(x))
is_inside = tf.cond(tf.less(r,1), lambda: tf.constant(1), lambda: tf.constant(0))
hit_op = hit.assign_add(is_inside)
with tf.control_dependencies([hit_op]):
return ctr + 1
def condition(ctr):
return ctr < n_trials
with tf.Session() as sess:
tf.global_variables_initializer().run()
result = tf.while_loop(condition, body, [tf.constant(0)])
start = time.time()
sess.run(result)
hits = hit.eval()
print("Pi is {}".format(4.*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))

Simple, session.run has much overhead, and it is not designed to be used that way. Normally, having e.g. a neural net you would call a single session.run for a dozen of multiplications of big matrices, then this 0.2 ms it takes would not matter at all.
As for your case, you wanted something like that probably. It runs 5 times faster than numpy version on my machine.
By the way, you do exactly same thing in numpy. If you used loop to reduce instead of np.sum it would be much slower.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
x = tf.random_uniform(shape=(n_trials,), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit2= tf.reduce_sum(is_inside)
#hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
# init_op = tf.global_variables_initializer()
sess.run(tf.initialize_all_variables())
# Make sure no new nodes are added to the graph
sess.graph.finalize()
start = time.time()
# Run monte carlo trials -- This is very slow
#for _ in range(n_trials):
sess.run(hit2)
hits = hit2.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))

Tensorflow reuse variables different before and after training

In trying to learn a bit about Tensorflow, I had been building a Variational Auto Encoder, which is working, however I noticed that, after training, I was getting different results from the decoders which are sharing the same variables.
I created two decoders, because the first I train against my dataset, the second I want to eventually feed a new Z encoding in order to produce new values.
My check is that I shoud be able to send the Z values generated from the encoding process to both decoders and get equal results.
I have 2 Decoders (D, D_new). D_new shares the variable scope from D.
before training, I can send values into the Encoder (E) to generate output values as well as the Z values it generated (Z_gen).
if I use Z_gen as input to D_new before training then its output is identical to the output of D, which is expected.
After a few iterations of training, however, the output of D compared with D_new begins to diverge (although they are quite similar).
I have paired this down to a more simple version of my code which still reproduces the error. I'm wondering if others have found this to be the case and where I might be able to correct for it.
The below code can be run in a jupyter notebook. I'm using Tensorflow r0.11 and Python 3.5.0
import numpy as np
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import os
import pylab as pl
mgc = get_ipython().magic
mgc(u'matplotlib inline')
pl.rcParams['figure.figsize'] = (8.0, 5.0)
##-- Helper function Just for visualizing the data
def plot_values(values, file=None):
t = np.linspace(1.0,len(values[0]),len(values[0]))
for i in range(len(values)):
plt.plot(t,values[i])
if file is None:
plt.show()
else:
plt.savefig(file)
plt.close()
def encoder(input, n_hidden, n_z):
with tf.variable_scope("ENCODER"):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("Z"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_z
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Z", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
Z_gen = tf.matmul(prevLayer,w)
return Z_gen
def decoder(input, n_hidden, n_outputs, reuse=False):
with tf.variable_scope("DECODER", reuse=reuse):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("OUTPUT"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_outputs
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Output", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
out = tf.sigmoid(tf.matmul(prevLayer,w))
return out
Here is where the Tensorflow graph is setup:
batch_size = 3
n_inputs = 100
n_hidden_nodes = 12
n_z = 2
with tf.variable_scope("INPUT_VARS"):
with tf.name_scope("X"):
X = tf.placeholder(tf.float32, shape=(None, n_inputs))
with tf.name_scope("Z"):
Z = tf.placeholder(tf.float32, shape=(None, n_z))
Z_gen = encoder(X,n_hidden_nodes,n_z)
D = decoder(Z_gen, n_hidden_nodes, n_inputs)
D_new = decoder(Z, n_hidden_nodes, n_inputs, reuse=True)
with tf.name_scope("COST"):
loss = -tf.reduce_mean(X * tf.log(1e-6 + D) + (1-X) * tf.log(1e-6 + 1 - D))
train_step = tf.train.AdamOptimizer(0.001, beta1=0.5).minimize(loss)
I'm generating a training set of 3 samples of normal distribution noise with 100 data points and then sort it to more easily visualize:
train_data = (np.random.normal(0,1,(batch_size,n_inputs)) + 3) / 6.0
train_data.sort()
plot_values(train_data)
startup the session:
sess = tf.InteractiveSession()
sess.run(tf.group(tf.initialize_all_variables(), tf.initialize_local_variables()))
Lets just look at what the network initially generates before training...
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Pulling the Z generated values and feeding them to D_new which is reusing the variables from D:
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
Just for sanity I'll plot the difference between the two to be sure they're the same...
Now run 1000 training epochs and plot the result...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Now lets feed those same Z values to D_new and plot those results...
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
They look pretty similar. But (I think) they should be exactly the same. Let's look at the difference...
plot_values(resultA - resultB)
You can see there is some variation now. This becomes much more dramatic with a larger network on more complex data, but still shows up in this simple example.
Any clues as to what's going on?

There are some methods (don't know which one specifically) which can be supplied with a seed value. Besides those, I'm not even sure if the training process is completely deterministic, especially when the GPU is involved, simply by the nature of parallelization.
See this question.

While I don't have a full explanation for the reason why, I was able to resolve my issue by changing:
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
to...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
Note, that I simply ran and extracted the result and Z_vals one last time, without the train_step.
The reason I was still seeing problems in my more complex setup was that I had bias variables (even though they were set to 0.0) that were being generated with...
b = tf.Variable(tf.constant(self.bias_k, shape=[n_layer_outputs], dtype=tf.float32))
And that is somehow not considered while using reuse with a tf.variable_scope. So there were variables technically not being reused. Why they presented such a problem when set to 0.0 I'm not sure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.