I'm curently working on a machine-learning project.
I use Python3 with tensorflow to train a CNN neural network and I would like to measure its performance by using tensorboard.
I would like to measure the loss value per epoch. But instead of having only 1 graph with in X the value of the epoch and in Y the value of the loss, I have 2 graphs, one with epochs value and the other the loss value.
I put a screen-shot here:
There is the training part of my code:
with tf.Session(config = config) as sess:
sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(graphDirectory, sess.graph)
# Generate shepp logan for validation
x_arr_validate, y_arr_validate, x_true_arr_validate, y_true_arr_validate = generateData(datasize,nbiter,reco_space,operator,pseudoinverse,validation=True )
for step in tqdm(range(epoch)):
#Generate trading data
x_arr, y_arr, x_true_arr, y_true_arr = generateData(datasize,nbiter,reco_space,operator,pseudoinverse)
#Training
feed_dict = {x0: x_arr,
x_true: x_true_arr,
y: y_arr}
_,loss_training = sess.run([optimizer, loss], feed_dict)
#Validation
feed_dictValidate = {x0 : x_arr_validate,
x_true : x_true_arr_validate,
y : y_arr_validate}
x_values_result, loss_result = sess.run([x_values, loss], feed_dictValidate)
lossSummary = tf.Summary(value=[tf.Summary.Value(tag="loss", simple_value=loss_result)])
epochSummary = tf.Summary(value=[tf.Summary.Value(tag="epoch", simple_value=step)])
writer.add_summary(lossSummary)
writer.add_summary(epochSummary)
saver.save(sess, sessFileName,write_meta_graph=True)
writer.close()
I try to change:
writer.add_summary(lossSummary)
writer.add_summary(epochSummary)
By:
writer.add_summary(lossSummary,epochSummary)
But that doesn't work.
I aslo try to create an array:
step_per_epoch = []
...
x_values_result, loss_result = sess.run([x_values, loss], feed_dictValidate)
step_per_epoch.append(loss_result)
lossSummary = tf.Summary(value=[tf.Summary.Value(tag="loss", simple_value=step_per_epoch)])
writer.add_summary(lossSummary)
But got the following error:
lossSummary = tf.Summary(value=[tf.Summary.Value(tag="loss", simple_value=loss_per_epoch)])
TypeError: [] has type list, but expected one of: int, long, float
I have no idea. Any hints or tips ? Thanks you
If you want to see the epochs in the horizontal axis, then you have to pass a global_step parameter along with the summary (see the documentation for tf.summary.FileWriter.add_summary). In your case, that would be:
writer.add_summary(lossSummary, step)
writer.add_summary(epochSummary, step)
Alternatively, if you change the "Horizontal Axis" selection in this panel:
From "Step" to "Relative" or "Wall" you will have relative or absolute time stamps in the X axis, which will allow you to see the progress.
Related
I am trying to properly read in my own binary data to Tensorflow based on Fixed length records section of this tutorial, and by looking at the read_cifar10 function here. Mind you I am new to tensorflow, so my understanding may be off.
My Data
My files are binary with float32 type. The first 32 bit sample is the label, and the remaining 256 samples are the data. I want to reshape the data at the end to a [2, 128] matrix.
My Code So far:
import tensorflow as tf
import os
def read_data(filename_queue):
item_type = tf.float32
label_items = 1
data_items = 256
label_bytes = label_items * item_type.size
data_bytes = data_items * item_type.size
record_bytes = label_bytes + data_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_data = tf.decode_raw(value, item_type)
# labels = tf.cast(tf.strided_slice(record_data, [0], [label_items]), tf.int32)
label = tf.strided_slice(record_data, [0], [label_items])
data0 = tf.strided_slice(record_data, [label_items], [label_items + data_items])
data = tf.reshape(data0, [2, data_items/2])
return data, label
if __name__ == '__main__':
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set GPU device
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
(x, y) = read_data(filename_queue)
print(y.eval())
This code hands at the print(y.eval()), but I fear I have much bigger issues than that.
Question:
When I execute this, I get a data and label tensor returned. The problem is I don't quite understand how to actually read the data from the tensor. For example, I understand the autoencoder example here, however this has a mnist.train.next_batch(batch_size) function that is called to read the next batch. Do I need to write that for my function, or is it handled by something internal to my read_data() function. If I need to write that function, what does it look like?
Are their any other obvious things I'm missing? My goal in using this method is to reduce I/O overhead, and not store all of the data in memory, since my file are quite large.
Thanks in advance.
Yes. You are pretty much done. At this point you need to:
1) Write your neural network model model which is supposed to take your data and return a label.
2) Write your cost function C which takes the network prediction and the true label and gives you a cost.
3) Choose and optimizer.
4) Put everything together:
opt = tf.AdamOptimizer(learning_rate=0.001)
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[data, label], batch_size=128)
y_pred = model(data)
loss = C(label, y_pred)
After which you iterate and minimize the loss with:
opt.minimize(loss)
See also tf.train.string_input_producer behavior in a loop for related information.
I have wrote a program with Tensorflow that identifies a number of figures in an image. The model is trained with a function and then used with another function to label the figures. The training have been done on my computer and the resulting model upload to aws with the solve function.
I my computer it works well, but when create a lambda in aws it works strange and start giving different answers with the same test data.
The model in the solve function is this:
# Recreate neural network from model file generated during training
# input
x = tf.placeholder(tf.float32, [None, size_of_image])
# weights
W = tf.Variable(tf.zeros([size_of_image, num_chars]))
# biases
b = tf.Variable(tf.zeros([num_chars]))
The solve function code to label the figures is this:
for testi in range(captcha_letters_num):
# load model from file
saver = tf.train.import_meta_graph(model_path + '.meta',
clear_devices=True)
saver.restore(sess, model_path)
# Data to label
test_x = np.asarray(char_imgs[testi], dtype=np.float32)
predict_op = model(test_x, W, b)
op = sess.run(predict_op, feed_dict={x: test_x})
# find max probability from the probability distribution returned by softmax
max_probability = op[0][0]
max_probability_index = -1
for i in range(num_chars):
if op[0][i] > max_probability:
max_probability = op[0][i]
max_probability_index = i
# append it to final output
final_text += char_map_list[max_probability_index]
# Reset the model so it can be used again
tf.reset_default_graph()
With the same test data it gives different answers, don't know why.
Solved!
What I finally do was to keep the Session outside the loop and initialize the variables. After ending the loop, reset the graph.
saver = tf.train.Saver()
sess = tf.Session()
# Initialize variables
sess.run(tf.global_variables_initializer())
.
.
.
# passing each of the 5 characters through the NNet
for testi in range(captcha_letters_num):
# Data to label
test_x = np.asarray(char_imgs[testi], dtype=np.float32)
predict_op = model(test_x, W, b)
op = sess.run(predict_op, feed_dict={x: test_x})
# find max probability from the probability distribution returned by softmax
max_probability = op[0][0]
max_probability_index = -1
for i in range(num_chars):
if op[0][i] > max_probability:
max_probability = op[0][i]
max_probability_index = i
# append it to final output
final_text += char_map_list[max_probability_index]
# Reset the model so it can be used again
tf.reset_default_graph()
sess.close()
I'm training a convolutional model in tensorflow. After training the model for about 70 epochs, which took almost 1.5 hrs, I couldn't save the model. It gave me ValueError: GraphDef cannot be larger than 2GB. I found that as the training proceeds the number of nodes in my graph keeps increasing.
At epochs 0,3,6,9, the number of nodes in the graph are 7214, 7238, 7262, 7286 respectively. When I use with tf.Session() as sess:, instead of passing the session as sess = tf.Session(), the number of nodes are 3982, 4006, 4030, 4054 at epochs 0,3,6,9 respectively.
In this answer, it is said that as nodes get added to the graph, it can exceed its maximum size. I need help with understanding how the number of nodes keep going up in my graph.
I train my model using the code below:
def runModel(data):
'''
Defines cost, optimizer functions, and runs the graph
'''
X, y,keep_prob = modelInputs((755, 567, 1),4)
logits = cnnModel(X,keep_prob)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="cost")
optimizer = tf.train.AdamOptimizer(.0001).minimize(cost)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1), name="correct_pred")
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
for e in range(12):
batch_x, batch_y = data.next_batch(30)
x = tf.reshape(batch_x, [30, 755, 567, 1]).eval(session=sess)
batch_y = tf.one_hot(batch_y,4).eval(session=sess)
sess.run(optimizer, feed_dict={X: x, y: batch_y,keep_prob:0.5})
if e%3==0:
n = len([n.name for n in tf.get_default_graph().as_graph_def().node])
print("No.of nodes: ",n,"\n")
current_cost = sess.run(cost, feed_dict={X: x, y: batch_y,keep_prob:1.0})
acc = sess.run(accuracy, feed_dict={X: x, y: batch_y,keep_prob:1.0})
print("At epoch {epoch:>3d}, cost is {a:>10.4f}, accuracy is {b:>8.5f}".format(epoch=e, a=current_cost, b=acc))
What causes an increase in the number of nodes?
You are creating new nodes within your training loop. In particular, you are calling tf.reshape and tf.one_hot, each of which creates one (or more) nodes. You can either:
Create those nodes outside of the graph using placeholders as inputs, and then only evaluate them in the loop.
Not use TensorFlow for those operations and use instead NumPy or equivalent operations.
I would recommend the second one, since there does not seem to be any benefit in using TensorFlow for data preparation. You can have something like:
import numpy as np
# ...
x = np.reshape(batch_x, [30, 755, 567, 1])
# ...
# One way of doing one-hot encoding with NumPy
classes_arr = np.arange(4).reshape([1] * batch_y.ndims + [-1])
batch_y = (np.expand_dims(batch_y, -1) == classes_arr).astype(batch_y.dtype)
# ...
PD: I'd also recommend using tf.Session() in a with context manager to make sure its close() method is called at the end (unless you want to keep using the same session later).
Another option, that solved a similar problem for me, is to use tf.reset_default_graph()
In case where suppose I have a trained RNN (e.g. language model), and I want to see what it would generate on its own, how should I feed its output back to its input?
I read the following related questions:
TensorFlow using LSTMs for generating text
TensorFlow LSTM Generative Model
Theoretically it is clear to me, that in tensorflow we use truncated backpropagation, so we have to define the max step which we would like to "trace". Also we reserve a dimension for batches, therefore if I'd like to train a sine wave, I have to feed [None, num_step, 1] inputs.
The following code works:
tf.reset_default_graph()
n_samples=100
state_size=5
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
X = tf.placeholder_with_default(zero_x, [None, n_samples, 1])
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Y = np.roll(def_x, 1)
loss = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
opt = tf.train.AdamOptimizer().minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Initial state run
plt.show(plt.plot(output.eval()[0]))
plt.plot(def_x.squeeze())
plt.show(plt.plot(pred.eval().squeeze()))
steps = 1001
for i in range(steps):
p, l, _= sess.run([pred, loss, opt])
The state size of the LSTM can be varied, also I experimented with feeding sine wave into the network and zeros, and in both cases it converged in ~500 iterations. So far I have understood that in this case the graph consists n_samples number of LSTM cells sharing their parameters, and it is only up to me that I feed input to them as a time series. However when generating samples the network is explicitly depending on its previous output - meaning that I cannot feed the unrolled model at once. I tried to compute the state and output at every step:
with tf.variable_scope('sine', reuse=True):
X_test = tf.placeholder(tf.float64)
X_reshaped = tf.reshape(X_test, [1, -1, 1])
output, last_states = tf.nn.dynamic_rnn(lstm_cell, X_reshaped, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
test_vals = [0.]
for i in range(1000):
val = pred.eval({X_test:np.array(test_vals)[None, :, None]})
test_vals.append(val)
However in this model it seems that there is no continuity between the LSTM cells. What is going on here?
Do I have to initialize a zero array with i.e. 100 time steps, and assign each run's result into the array? Like feeding the network with this:
run 0: input_feed = [0, 0, 0 ... 0]; res1 = result
run 1: input_feed = [res1, 0, 0 ... 0]; res2 = result
run 1: input_feed = [res1, res2, 0 ... 0]; res3 = result
etc...
What to do if I want to use this trained network to use its own output as its input in the following time step?
If I understood you correctly, you want to find a way to feed the output of time step t as input to time step t+1, right? To do so, there is a relatively easy work around that you can use at test time:
Make sure your input placeholders can accept a dynamic sequence length, i.e. the size of the time dimension is None.
Make sure you are using tf.nn.dynamic_rnn (which you do in the posted example).
Pass the initial state into dynamic_rnn.
Then, at test time, you can loop through your sequence and feed each time step individually (i.e. max sequence length is 1). Additionally, you just have to carry over the internal state of the RNN. See pseudo code below (the variable names refer to your code snippet).
I.e., change the definition of the model to something like this:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
X = tf.placeholder_with_default(zero_x, [None, None, 1]) # [batch_size, seq_length, dimension of input]
batch_size = tf.shape(self.input_)[0]
initial_state = lstm_cell.zero_state(batch_size, dtype=tf.float32)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64,
initial_state=initial_state)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Then you can perform inference like so:
fetches = {'final_state': last_state,
'prediction': pred}
toy_initial_input = np.array([[[1]]]) # put suitable data here
seq_length = 20 # put whatever is reasonable here for you
# get the output for the first time step
feed_dict = {X: toy_initial_input}
eval_out = sess.run(fetches, feed_dict)
outputs = [eval_out['prediction']]
next_state = eval_out['final_state']
for i in range(1, seq_length):
feed_dict = {X: outputs[-1],
initial_state: next_state}
eval_out = sess.run(fetches, feed_dict)
outputs.append(eval_out['prediction'])
next_state = eval_out['final_state']
# outputs now contains the sequence you want
Note that this can also work for batches, however it can be a bit more complicated if you sequences of different lengths in the same batch.
If you want to perform this kind of prediction not only at test time, but also at training time, it is also possible to do, but a bit more complicated to implement.
You can use its own output (last state) as the next-step input (initial state).
One way to do this is to:
use zero-initialized variables as the input state at every time step
each time you completed a truncated sequence and got some output state, update the state variables with this output state you just got.
The second can be done by either:
fetching the states to python and feeding them back next time, as done in the ptb example in tensorflow/models
build an update op in the graph and add a dependency, as done in the ptb example in tensorpack.
I know I'm a bit late to the party but I think this gist could be useful:
https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31
It lets you autofeed the input through a filter and back into the network as input. To make shapes match up processing can be set as a tf.layers.Dense layer.
Please ask any questions!
Edit:
In your particular case, create a lambda which performs the processing of the dynamic_rnn outputs into your character vector space. Ex:
# if you have:
W = tf.Variable( ... )
B = tf.Variable( ... )
Yo, Ho = tf.nn.dynamic_rnn( cell , inputs , state )
logits = tf.matmul(W, Yo) + B
...
# use self_feeding_rnn as
process_yo = lambda Yo: tf.matmul(W, Yo) + B
Yo, Ho = self_feeding_rnn( cell, seed, initial_state, processing=process_yo)
I am aiming to do big things with TensorFlow, but I'm trying to start small.
I have small greyscale squares (with a little noise) and I want to classify them according to their colour (e.g. 3 categories: black, grey, white). I wrote a little Python class to generate squares, and 1-hot vectors, and modified their basic MNIST example to feed them in.
But it won't learn anything - e.g. for 3 categories it always guesses ≈33% correct.
import tensorflow as tf
import generate_data.generate_greyscale
data_generator = generate_data.generate_greyscale.GenerateGreyScale(28, 28, 3, 0.05)
ds = data_generator.generate_data(10000)
ds_validation = data_generator.generate_data(500)
xs = ds[0]
ys = ds[1]
num_categories = data_generator.num_categories
x = tf.placeholder("float", [None, 28*28])
W = tf.Variable(tf.zeros([28*28, num_categories]))
b = tf.Variable(tf.zeros([num_categories]))
y = tf.nn.softmax(tf.matmul(x,W) + b)
y_ = tf.placeholder("float", [None,num_categories])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
# let batch_size = 100 --> therefore there are 100 batches of training data
xs = xs.reshape(100, 100, 28*28) # reshape into 100 minibatches of size 100
ys = ys.reshape((100, 100, num_categories)) # reshape into 100 minibatches of size 100
for i in range(100):
batch_xs = xs[i]
batch_ys = ys[i]
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
xs_validation = ds_validation[0]
ys_validation = ds_validation[1]
print sess.run(accuracy, feed_dict={x: xs_validation, y_: ys_validation})
My data generator looks like this:
import numpy as np
import random
class GenerateGreyScale():
def __init__(self, num_rows, num_cols, num_categories, noise):
self.num_rows = num_rows
self.num_cols = num_cols
self.num_categories = num_categories
# set a level of noisiness for the data
self.noise = noise
def generate_label(self):
lab = np.zeros(self.num_categories)
lab[random.randint(0, self.num_categories-1)] = 1
return lab
def generate_datum(self, lab):
i = np.where(lab==1)[0][0]
frac = float(1)/(self.num_categories-1) * i
arr = np.random.uniform(max(0, frac-self.noise), min(1, frac+self.noise), self.num_rows*self.num_cols)
return arr
def generate_data(self, num):
data_arr = np.zeros((num, self.num_rows*self.num_cols))
label_arr = np.zeros((num, self.num_categories))
for i in range(0, num):
label = self.generate_label()
datum = self.generate_datum(label)
data_arr[i] = datum
label_arr[i] = label
#data_arr = data_arr.astype(np.float32)
#label_arr = label_arr.astype(np.float32)
return data_arr, label_arr
For starters, try initializing your W matrix with random values, not zeros - you're not giving the optimizer anything to work with when the output is all zeros for all inputs.
Instead of:
W = tf.Variable(tf.zeros([28*28, num_categories]))
Try:
W = tf.Variable(tf.truncated_normal([28*28, num_categories],
stddev=0.1))
You issue is that the your gradients are increasing/decreasing without bounds, causing the loss function to become nan.
Take a look at this question: Why does TensorFlow example fail when increasing batch size?
Furthermore, make sure that you run the model for a sufficient number of steps. You are only running it once through your train dataset (100 times * 100 examples), and this is not enough for it to converge. Increase it to something like 2000 at a minimum (running 20 times through your dataset).
Edit (can't comment, so i'll add my thoughts here):
The point of the post i linked is that you can use GradientDescentOptimizer, as long as you make the learning rate something like 0.001. That's the issue, your learning rate was too high for the loss function you were using.
Alternatively, use a different loss function, that doesn't increase/decrease the gradients as much. Use tf.reduce_mean instead of tf.reduce_sum in the definition of crossEntropy.
While dga and syncd's responses were helpful, I tried using non-zero weight initialization and larger datasets but to no avail. The thing that finally worked was using a different optimization algorithm.
I replaced:
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
with
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
I also embedded the training for loop in another for loop to train for several epochs, resulting in convergence like this:
===# EPOCH 0 #===
Error: 0.370000004768
===# EPOCH 1 #===
Error: 0.333999991417
===# EPOCH 2 #===
Error: 0.282000005245
===# EPOCH 3 #===
Error: 0.222000002861
===# EPOCH 4 #===
Error: 0.152000010014
===# EPOCH 5 #===
Error: 0.111999988556
===# EPOCH 6 #===
Error: 0.0680000185966
===# EPOCH 7 #===
Error: 0.0239999890327
===# EPOCH 8 #===
Error: 0.00999999046326
===# EPOCH 9 #===
Error: 0.00400000810623
EDIT - WHY IT WORKS: I suppose the problem was that I didn't manually choose a good learning rate schedule, and Adam was able to generate a better one automatically.
Found this question when I was having a similar issue..I fixed mine by scaling the features.
A little background: I was following the tensorflow tutorial, however I wanted to use the data from Kaggle(see data here) to do the modeling, but in the beginning I kepted having the same issue: the model just doesn't learn..after rounds of trouble-shooting, I realized that the Kaggle data was on a completely different scale. Therefore, I scaled the data so that it shares the same scale(0,1) as the tensorflow's MNIST dataset.
Just figured that I would add my two cents here..in case some beginners who are trying to follow the tutorial's settings get stuck like I did =)