I am trying to implement a little tweaked version of the Batch Normalization operation; in which I need to keep the moving average values like mean and variance explicitly. In order to do that, I am doing some experimentation with assignment and control dependency mechanisms in the Tensorflow and I run into a mysterious problem. I have the following toy code; in which I am trying to test whether the tf.control_dependencies work as intended:
dataset = MnistDataSet(validation_sample_count=10000,
load_validation_from="validation_indices")
samples, labels, indices_list, one_hot_labels =
dataset.get_next_batch(batch_size=GlobalConstants.BATCH_SIZE)
samples = np.expand_dims(samples, axis=3)
flat_data = tf.contrib.layers.flatten(GlobalConstants.TRAIN_DATA_TENSOR)
mean = tf.Variable(name="mean", initial_value=tf.constant(100.0, shape=[784], dtype=tf.float32),
trainable=False, dtype=tf.float32)
a = tf.Variable(name="a", initial_value=5.0, trainable=False)
b = tf.Variable(name="b", initial_value=4.0, trainable=False)
c = tf.Variable(name="c", initial_value=0.0, trainable=False)
batch_mean, batch_var = tf.nn.moments(flat_data, [0])
b_op = tf.assign(b, a)
mean_op = tf.assign(mean, batch_mean)
with tf.control_dependencies([b_op, mean_op]):
c = a + b
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
results = sess.run([c, mean], feed_dict={GlobalConstants.TRAIN_DATA_TENSOR: samples})
I am simply loading a data batch with each entry having 784 dimensions, calculate the moments of it and try to store the batch_mean into the variable mean. I trivially store the variable a's value into b as well.
In the last line, when I run the graph for the values of c and mean, I see c as 10, which is the expected value. But mean is still a vector of 100's and does not contain the batch mean. It is like the mean_op = tf.assign(mean, batch_mean) has not been executed.
What can be the reason of this? As far as I know, all operations in the tf.control_dependencies call must be executed before any operation in the following context; I explicitly call c here, which is in the context. Am I missing something?
This is a known "feature" of tf.Session.run(). The c and mean ops are independent, hence mean may be evaluated before c (which would update mean).
Here's a shorter version of this effect:
a = tf.Variable(name="a", initial_value=1.0, trainable=False)
b = tf.Variable(name="b", initial_value=0.0, trainable=False)
dependent_op = tf.assign(b, a * 3)
with tf.control_dependencies([dependent_op]):
c = a + 1
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run([c, b]))
print(sess.run([b]))
The second evaluation of b is guaranteed to return [3.0]. But the first run may return either [2.0 3.0] or [2.0 0.0].
Related
I wanna implement a function like this:if x == k, f(x) = 1, else f(x) = 0(k is a parameter). So I used tf.equal and tf.cast and my code was like this:
import tensorflow as tf
a = range(12)
a = tf.Variable(a)
b = 6
b = tf.Variable(b)
a = tf.reshape(a, [3, 4])
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
c = tf.equal(a, b)
d = tf.cast(c, tf.int32)
print sess.run(c)
print sess.run(d)
It seems fine, but the problem is tf.gradients(d, a) and tf.gradients(d, b) are None. I tried tf.gradients(c, a) and got TypedError. Are there any decent way to implement this function?
I'm not sure the gradient is even defined here.
The indicator function is f(a,b) = 1 if a=b, 0 otherwise. Away from a=b, this function is constant (zero) and so has zero derivative. At any point where a=b the function is discontinuous, so it doesn't have a derivative there.
More intuitively: derivatives don't exist where you have a 'jump' in your function.
It would be possible to have the PDF of the normal distribution to approximate the indicator function. I am also new to TensorFlow, so feel free to point out any issue.
##I am using tensorflow2
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import tensorflow_probability as tfp
a = tf.range(12)
a = tf.Variable(a)
b = 6
b = tf.Variable(b)
a = tf.reshape(a, [3, 4])
## Define the PDF of a normal distribution to approximate the indicator function
dist = tfp.distributions.Normal(0., 0.1)
scalar = dist.prob(0) # a normalization constant
#since the pdf at data zero is not one
## Implement the approximazed indicator function
a = tf.cast(a, dtype= tf.float32)
b = tf.cast(b, dtype= tf.float32)
c = dist.prob(a-b)/scalar
#d = tf.cast(c, tf.int32)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(c))
## calcualte the gradient
c_a = tf.gradients(c, a)
print(sess.run(c_a))
I am new to tensorflow , I am not able to understand the difference of variable and constant, I get the idea that we use variables for equations and constants for direct values , but why code #1 works only and why not code#2 and #3, and please explain in which cases we have to run our graph first(a) and then our variable(b) i.e
(a) session.run(model)
(b) print(session.run(y))
and in which case I can directly execute this command
i.e
print(session.run(y))
Code #1 :
x = tf.constant(35, name='x')
y = tf.Variable(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
print(session.run(y))
Code #2 :
x = tf.Variable(35, name='x')
y = tf.Variable(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
print(session.run(y))
Code #3 :
x = tf.constant(35, name='x')
y = tf.constant(x + 5, name='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
print(session.run(y))
In TensorFlow the differences between constants and variables are that when you declare some constant, its value can't be changed in the future (also the initialization should be with a value, not with operation).
Nevertheless, when you declare a Variable, you can change its value in the future with tf.assign() method (and the initialization can be achieved with a value or operation).
The function tf.global_variables_initializer() initialises all variables in your code with the value passed as parameter, but it works in async mode, so doesn't work properly when dependencies exists between variables.
Your first code (#1) works properly because there is no dependencies on variable initialization and the constant is constructed with a value.
The second code (#2) doesn't work because of the async behavior of tf.global_variables_initializer(). You can fix it using tf.variables_initializer() as follows:
x = tf.Variable(35, name='x')
model_x = tf.variables_initializer([x])
y = tf.Variable(x + 5, name='y')
model_y = tf.variables_initializer([y])
with tf.Session() as session:
session.run(model_x)
session.run(model_y)
print(session.run(y))
The third code (#3) doesn't work properly because you are trying to initialize a constant with an operation, that isn't possible. To solve it, an appropriate strategy is (#1).
Regarding to your last question. You need to run (a) session.run(model) when there are variables in your calculation graph (b) print(session.run(y)).
I will point the difference when using eager execution.
As of Tensorflow 2.0.b1, Variables and Constant trigger different behaviours when using tf.GradientTape. Strangely, the official document is not verbal about it enough.
Let's look at the example code in https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/GradientTape
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
g.watch(x)
y = x * x
z = y * y
dz_dx = g.gradient(z, x) # 108.0 (4*x^3 at x = 3)
dy_dx = g.gradient(y, x) # 6.0
del g # Drop the reference to the tape
You had to watch x which is a Constant. GradientTape does NOT automatically watch constants in the context. Additionally, it can watch only one tensor per GradientTape. If you want to get gradients of multiple Constants, you need to nest GradientTapes. For example,
x = tf.constant(3.0)
x2 = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
g.watch(x)
with tf.GradientTape(persistent=True) as g2:
g2.watch(x2)
y = x * x
y2 = y * x2
dy_dx = g.gradient(y, x) # 6
dy2_dx2 = g2.gradient(y2, x2) # 9
del g, g2 # Drop the reference to the tape
On the other hand, Variables are automatically watched by GradientTape.
By default GradientTape will automatically watch any trainable variables that are accessed inside the context. Source: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/GradientTape
So the above will look like,
x = tf.Variable(3.0)
x2 = tf.Variable(3.0)
with tf.GradientTape(persistent=True) as g:
y = x * x
y2 = y * x2
dy_dx = g.gradient(y, x) # 6
dy2_dx2 = g.gradient(y2, x2) # 9
del g # Drop the reference to the tape
print(dy_dx)
print(dy2_dx2)
Of course, you can turn off the automatic watching by passing watch_accessed_variables=False. The examples may not be so practical but I hope this clears someone's confusion.
Another way to look to the differences is:
tf.constant : are fixed values, and hence not trainable.
tf.Variable: these are tensors (arrays) that were initialized in a session and are trainable (with trainable i mean this can be optimized and can changed over time)
I want to have basically two different graphs based on whether I am training my network or actually running it. Basically, one piece uses some unsupervised techniques to learn the values of a given matrix and then I want to use the exact same matrix in a different graph.
I know how to get the value of the matrix using matrix_value = sess.run(my_matrix, {input=input_data}) but then is there a way to initialize a tf.Variable with a set value?
You can try something like this:
import numpy as np
import tensorflow as tf
value = [0, 1, 2, 3, 4, 5, 6, 7]
init = tf.constant_initializer(value)
with tf.Session():
x = tf.get_variable('x', shape=[2, 4], initializer=init)
x.initializer.run()
print(x.eval())
I hope this helps !
You don't have to create two identical graphs, you could just use the same but run different nodes.
Let me explain what I mean. Let's look at this example:
W = tf.Variable(tf.truncated_normal([1, 1], stddev=0.1))
# Declare bias variable initialized to a constant 0.1
b = tf.Variable(tf.constant(0.1, shape=[1]))
y_pred = x_ph * W + b
# loss function
loss = tf.mul(tf.reduce_mean(tf.square(tf.sub(y_pred, y_ph))), 1. / 2)
The first part calls the train_op.
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
This op will run a gradient descent step based on the loss op which will induce an update of the variables W and b.
with tf.Session() as sess:
# initialize all the variables by running the initializer operator
sess.run(init)
for epoch in xrange(num_epoch):
# Run sequentially the train_op and loss operators with
# x_ph and y_ph placeholders fed by variables x and y
_, loss_val = sess.run([train_op, loss], feed_dict={x_ph: x, y_ph: y})
print('epoch %d: loss is %.4f' % (epoch, loss_val))
But now if you just want to run it you can just run the y_pred op. It will pick up the current values of W and b and they won't be modified as you didn't call the train_op.
with tf.Session() as sess:
# see what model do in the test set
# by evaluating the y_pred operator using the x_test data
test_val = sess.run(y_pred, feed_dict={x_ph: x_test})
When you ask TensorFlow to run an op y_pred with new data x_test fed into x_ph, it will only compute y_pred = x_ph * W + b (taking W and b as constant) without modifying anything else.
Also, it is worth mentioning that when you're done with training you have the ability to override some variable values (e.g. in the case of a variable value learnt to be very close to 1, you could just set it to 1 directly as per the TensorFlow documentation.
We could improve this manually by reassigning the values of W and b to the perfect values of -1 and 1. A variable is initialized to the value provided to tf.Variable but can be changed using operations like tf.assign. For example, W=-1 and b=1 are the optimal parameters for our model. We can change W and b accordingly:
fixW = tf.assign(W, [-1., 1.])
fixb = tf.assign(b, [1.])
sess.run([fixW, fixb])
print(sess.run(loss, {x_ph:x_test}))
In case where suppose I have a trained RNN (e.g. language model), and I want to see what it would generate on its own, how should I feed its output back to its input?
I read the following related questions:
TensorFlow using LSTMs for generating text
TensorFlow LSTM Generative Model
Theoretically it is clear to me, that in tensorflow we use truncated backpropagation, so we have to define the max step which we would like to "trace". Also we reserve a dimension for batches, therefore if I'd like to train a sine wave, I have to feed [None, num_step, 1] inputs.
The following code works:
tf.reset_default_graph()
n_samples=100
state_size=5
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
X = tf.placeholder_with_default(zero_x, [None, n_samples, 1])
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Y = np.roll(def_x, 1)
loss = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
opt = tf.train.AdamOptimizer().minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Initial state run
plt.show(plt.plot(output.eval()[0]))
plt.plot(def_x.squeeze())
plt.show(plt.plot(pred.eval().squeeze()))
steps = 1001
for i in range(steps):
p, l, _= sess.run([pred, loss, opt])
The state size of the LSTM can be varied, also I experimented with feeding sine wave into the network and zeros, and in both cases it converged in ~500 iterations. So far I have understood that in this case the graph consists n_samples number of LSTM cells sharing their parameters, and it is only up to me that I feed input to them as a time series. However when generating samples the network is explicitly depending on its previous output - meaning that I cannot feed the unrolled model at once. I tried to compute the state and output at every step:
with tf.variable_scope('sine', reuse=True):
X_test = tf.placeholder(tf.float64)
X_reshaped = tf.reshape(X_test, [1, -1, 1])
output, last_states = tf.nn.dynamic_rnn(lstm_cell, X_reshaped, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
test_vals = [0.]
for i in range(1000):
val = pred.eval({X_test:np.array(test_vals)[None, :, None]})
test_vals.append(val)
However in this model it seems that there is no continuity between the LSTM cells. What is going on here?
Do I have to initialize a zero array with i.e. 100 time steps, and assign each run's result into the array? Like feeding the network with this:
run 0: input_feed = [0, 0, 0 ... 0]; res1 = result
run 1: input_feed = [res1, 0, 0 ... 0]; res2 = result
run 1: input_feed = [res1, res2, 0 ... 0]; res3 = result
etc...
What to do if I want to use this trained network to use its own output as its input in the following time step?
If I understood you correctly, you want to find a way to feed the output of time step t as input to time step t+1, right? To do so, there is a relatively easy work around that you can use at test time:
Make sure your input placeholders can accept a dynamic sequence length, i.e. the size of the time dimension is None.
Make sure you are using tf.nn.dynamic_rnn (which you do in the posted example).
Pass the initial state into dynamic_rnn.
Then, at test time, you can loop through your sequence and feed each time step individually (i.e. max sequence length is 1). Additionally, you just have to carry over the internal state of the RNN. See pseudo code below (the variable names refer to your code snippet).
I.e., change the definition of the model to something like this:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
X = tf.placeholder_with_default(zero_x, [None, None, 1]) # [batch_size, seq_length, dimension of input]
batch_size = tf.shape(self.input_)[0]
initial_state = lstm_cell.zero_state(batch_size, dtype=tf.float32)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64,
initial_state=initial_state)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Then you can perform inference like so:
fetches = {'final_state': last_state,
'prediction': pred}
toy_initial_input = np.array([[[1]]]) # put suitable data here
seq_length = 20 # put whatever is reasonable here for you
# get the output for the first time step
feed_dict = {X: toy_initial_input}
eval_out = sess.run(fetches, feed_dict)
outputs = [eval_out['prediction']]
next_state = eval_out['final_state']
for i in range(1, seq_length):
feed_dict = {X: outputs[-1],
initial_state: next_state}
eval_out = sess.run(fetches, feed_dict)
outputs.append(eval_out['prediction'])
next_state = eval_out['final_state']
# outputs now contains the sequence you want
Note that this can also work for batches, however it can be a bit more complicated if you sequences of different lengths in the same batch.
If you want to perform this kind of prediction not only at test time, but also at training time, it is also possible to do, but a bit more complicated to implement.
You can use its own output (last state) as the next-step input (initial state).
One way to do this is to:
use zero-initialized variables as the input state at every time step
each time you completed a truncated sequence and got some output state, update the state variables with this output state you just got.
The second can be done by either:
fetching the states to python and feeding them back next time, as done in the ptb example in tensorflow/models
build an update op in the graph and add a dependency, as done in the ptb example in tensorpack.
I know I'm a bit late to the party but I think this gist could be useful:
https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31
It lets you autofeed the input through a filter and back into the network as input. To make shapes match up processing can be set as a tf.layers.Dense layer.
Please ask any questions!
Edit:
In your particular case, create a lambda which performs the processing of the dynamic_rnn outputs into your character vector space. Ex:
# if you have:
W = tf.Variable( ... )
B = tf.Variable( ... )
Yo, Ho = tf.nn.dynamic_rnn( cell , inputs , state )
logits = tf.matmul(W, Yo) + B
...
# use self_feeding_rnn as
process_yo = lambda Yo: tf.matmul(W, Yo) + B
Yo, Ho = self_feeding_rnn( cell, seed, initial_state, processing=process_yo)
In trying to learn a bit about Tensorflow, I had been building a Variational Auto Encoder, which is working, however I noticed that, after training, I was getting different results from the decoders which are sharing the same variables.
I created two decoders, because the first I train against my dataset, the second I want to eventually feed a new Z encoding in order to produce new values.
My check is that I shoud be able to send the Z values generated from the encoding process to both decoders and get equal results.
I have 2 Decoders (D, D_new). D_new shares the variable scope from D.
before training, I can send values into the Encoder (E) to generate output values as well as the Z values it generated (Z_gen).
if I use Z_gen as input to D_new before training then its output is identical to the output of D, which is expected.
After a few iterations of training, however, the output of D compared with D_new begins to diverge (although they are quite similar).
I have paired this down to a more simple version of my code which still reproduces the error. I'm wondering if others have found this to be the case and where I might be able to correct for it.
The below code can be run in a jupyter notebook. I'm using Tensorflow r0.11 and Python 3.5.0
import numpy as np
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import os
import pylab as pl
mgc = get_ipython().magic
mgc(u'matplotlib inline')
pl.rcParams['figure.figsize'] = (8.0, 5.0)
##-- Helper function Just for visualizing the data
def plot_values(values, file=None):
t = np.linspace(1.0,len(values[0]),len(values[0]))
for i in range(len(values)):
plt.plot(t,values[i])
if file is None:
plt.show()
else:
plt.savefig(file)
plt.close()
def encoder(input, n_hidden, n_z):
with tf.variable_scope("ENCODER"):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("Z"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_z
with tf.name_scope("Weights"):
w = tf.get_variable(name="E_Z", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
Z_gen = tf.matmul(prevLayer,w)
return Z_gen
def decoder(input, n_hidden, n_outputs, reuse=False):
with tf.variable_scope("DECODER", reuse=reuse):
with tf.name_scope("Hidden"):
n_layer_inputs = input.get_shape()[1].value
n_layer_outputs = n_hidden
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Hidden", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
a = tf.tanh(tf.matmul(input,w))
prevLayer = a
with tf.name_scope("OUTPUT"):
n_layer_inputs = prevLayer.get_shape()[1].value
n_layer_outputs = n_outputs
with tf.name_scope("Weights"):
w = tf.get_variable(name="D_Output", shape=[n_layer_inputs, n_layer_outputs], dtype=tf.float32)
with tf.name_scope("Activation"):
out = tf.sigmoid(tf.matmul(prevLayer,w))
return out
Here is where the Tensorflow graph is setup:
batch_size = 3
n_inputs = 100
n_hidden_nodes = 12
n_z = 2
with tf.variable_scope("INPUT_VARS"):
with tf.name_scope("X"):
X = tf.placeholder(tf.float32, shape=(None, n_inputs))
with tf.name_scope("Z"):
Z = tf.placeholder(tf.float32, shape=(None, n_z))
Z_gen = encoder(X,n_hidden_nodes,n_z)
D = decoder(Z_gen, n_hidden_nodes, n_inputs)
D_new = decoder(Z, n_hidden_nodes, n_inputs, reuse=True)
with tf.name_scope("COST"):
loss = -tf.reduce_mean(X * tf.log(1e-6 + D) + (1-X) * tf.log(1e-6 + 1 - D))
train_step = tf.train.AdamOptimizer(0.001, beta1=0.5).minimize(loss)
I'm generating a training set of 3 samples of normal distribution noise with 100 data points and then sort it to more easily visualize:
train_data = (np.random.normal(0,1,(batch_size,n_inputs)) + 3) / 6.0
train_data.sort()
plot_values(train_data)
startup the session:
sess = tf.InteractiveSession()
sess.run(tf.group(tf.initialize_all_variables(), tf.initialize_local_variables()))
Lets just look at what the network initially generates before training...
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Pulling the Z generated values and feeding them to D_new which is reusing the variables from D:
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
Just for sanity I'll plot the difference between the two to be sure they're the same...
Now run 1000 training epochs and plot the result...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
Now lets feed those same Z values to D_new and plot those results...
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
They look pretty similar. But (I think) they should be exactly the same. Let's look at the difference...
plot_values(resultA - resultB)
You can see there is some variation now. This becomes much more dramatic with a larger network on more complex data, but still shows up in this simple example.
Any clues as to what's going on?
There are some methods (don't know which one specifically) which can be supplied with a seed value. Besides those, I'm not even sure if the training process is completely deterministic, especially when the GPU is involved, simply by the nature of parallelization.
See this question.
While I don't have a full explanation for the reason why, I was able to resolve my issue by changing:
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
to...
for i in range(1000):
_, resultA, Z_vals = sess.run([train_step, D, Z_gen], feed_dict={X:train_data})
resultA, Z_vals = sess.run([D, Z_gen], feed_dict={X:train_data})
plot_values(resultA)
resultB = sess.run(D_new, feed_dict={Z:Z_vals})
plot_values(resultB)
plot_values(resultA - resultB)
Note, that I simply ran and extracted the result and Z_vals one last time, without the train_step.
The reason I was still seeing problems in my more complex setup was that I had bias variables (even though they were set to 0.0) that were being generated with...
b = tf.Variable(tf.constant(self.bias_k, shape=[n_layer_outputs], dtype=tf.float32))
And that is somehow not considered while using reuse with a tf.variable_scope. So there were variables technically not being reused. Why they presented such a problem when set to 0.0 I'm not sure.