I am training a neural network with tensorflow (1.12) in a supervised fashion. I'd like to only train on specific examples. The examples are created on the fly by cutting out subsequences, hence I want to do the conditioning within tensorflow.
This is my original part of code:
train_step, gvs = minimize_clipped(optimizer, loss,
clip_value=FLAGS.gradient_clip,
return_gvs=True)
gradients = [g for (g,v) in gvs]
gradient_norm = tf.global_norm(gradients)
tf.summary.scalar('gradients/norm', gradient_norm)
eval_losses = {'loss1': loss1,
'loss2': loss2}
The training step is later executed as:
batch_eval, _ = sess.run([eval_losses, train_step])
I was thinking about inserting something like
train_step_fake = ????
eval_losses_fake = tf.zeros_like(tensor)
train_step_new = tf.cond(my_cond, train_step, train_step_fake)
eval_losses_new = tf.cond(my_cond, eval_losses, eval_losses_fake)
and then doing
batch_eval, _ = sess.run([eval_losses, train_step])
However, I am not sure how to create a fake train_step.
Also, is this a good idea in general or is there a smoother way of doing this? I am using a tfrecords pipeline, but no other high-level modules (like keras, tf.estimator, eager execution etc.).
Any help is obviously greatly appreciated!
Answering the specific question first. It's certainly possible to only perform your training step based on the tf.cond outcome. Note that the 2nd and 3rd params are lambdas though so more something like:
train_step_new = tf.cond(my_cond, lambda: train_step, lambda: train_step_fake)
eval_losses_new = tf.cond(my_cond, lambda: eval_losses, lambda: eval_losses_fake)
Your instinct that this may not be the right thing to do is correct though.
It's much more preferable (both in terms of efficiency and in terms of reading and reasoning about your code) to filter out the data you want to ignore before it gets to your model in the first place.
This is something you could achieve using the Dataset API. which has a really useful filter() method you could use. If you are using the dataset api to read your TFRecords right now then this should be as simple as adding something along the lines of:
dataset = dataset.filter(lambda x: {whatever op you were going to use in tf.cond})
If you are not yet using the dataset API, now is probably the time to have a little read up on it and consider it rather than butchering the model with that tf.cond() to act as a filter.
Related
I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:
mean_squared_error(signal_true,signal_pred)
but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):
[0.013499056920409203, 0.07980187237262726, 0.013792216777801514]
the error from individual target (outputs):
Target0 0.167851388666284
Target1 0.6068108648555771
Target2 0.1710370357827747
Target3 2.747463225418181
Target4 1.7965991690103074
Target5 0.9065426398192563
I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.
thanks
There are a number of reasons that you can have differences between the loss for training and evaluation.
Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
It could be due to differences in the datasets if the split isn't random.
You may be using different metrics without realizing it.
I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.
I had the same problem and found a solution. Hopefully this is the same problem you encountered.
It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).
>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833
My quick and dirty solution:
valid_generator.reset() # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)): # Necessary for avoiding infinite loop
x = next(valid_generator)
pred_i = model.predict(x[0])[:,0]
labels_i = x[1]
all_labels.append(labels_i)
all_pred.append(pred_i)
print(np.shape(pred_i), np.shape(labels_i))
cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)
The result:
>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352
This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.
I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.
Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?
Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.
There are two (here and here) related posts on GitHub, but I don't think they answer this question.
Here is a minimal code example of my use-case for illustration:
with tf.Graph().as_default():
data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
input_tensors = (data,)
def pre_processing_func(data_):
# normally I would do data-augmentation here
results = (tf.expand_dims(data_, axis=0),)
return tf.data.Dataset.from_tensor_slices(results)
dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
dataset = dataset_source.flat_map(pre_processing_func)
# do something with 'dataset'
To the best of my knowledge, at the moment flat_map does not offer parallelism options.
Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.
In code:
NUM_THREADS = 5
BUFFER_SIZE = 1000
def pre_processing_func(data_):
# data-augmentation here
# generate new samples starting from the sample `data_`
artificial_samples = generate_from_sample(data_)
return atificial_samples
dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
map(pre_processing_func, num_parallel_calls=NUM_THREADS).
prefetch(BUFFER_SIZE).
flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
shuffle(BUFFER_SIZE)) # my addition, probably necessary though
Note (to myself and whoever will try to understand the pipeline):
Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.
It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.
All code is assuming Tensorflow 1.3 and Python 3.x
We are working on a GAN algorithm which has an interesting loss function.
Stage 1 - Compute only the completion/generator loss portion of the network
Iterates over the completion portion of the GAN for X iterations.
Stage 2 - Compute only the discriminator loss portion of the network
Iterates over the discriminator portion for Y iterations (but
don't train on Stage 1)
Stage 3 - Compute the full loss on the network
Iterate over both completion and discriminator for Z iterations
(training on the entire network).
We have this working single GPU. We want to make it work multi GPU since training times are long.
We have looked at the Tensorflow/models/tutorials/Images/cifar10/cifar10_multi_gpu_train.py, which talks about tower loss, averaging the towers together, computing your gradients on the GPUs then applying them on the CPU. This is a great start. However, since our loss is more complicated, it has complicated everything a bit for us.
The code is decently complicated, but is roughly similar to this, https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN, (but that won't run because it was written around Tensorflow 0.1, so it has some oddities that I haven't gotten working, but that should give you an idea of what we're doing)
When we compute gradients, it looks something like this (pseudocode to try to highlight the important portions):
for i in range(num_gpus):
with tf.device('/gpu:%d' % gpus[i]):
with tf.name_scope('Tower_%d' % gpus[i]) as scope:
with tf.variable_scope( "generator" )
generator = build_generator()
with tf.variable_scope( "discriminator" ):
with tf.variable_scope( "real_discriminator" ) :
real_discriminator = build_discriminator(x)
with tf.variable_scope( "fake_discriminator", reuse = True ):
fake_discriminator = build_discriminator(generator)
gen_only_loss, discm_only_loss, full_loss = build_loss( generator,
real_discriminator, fake_discriminator )
tf.get_variable_scope().reuse_variables()
gen_only_grads = gen_only_opt.compute_gradients(gen_only_loss)
tower_gen_only_grads.append(gen_only_grads)
discm_only_train_vars= tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, "discriminator" )
discm_only_train_vars= discm_only_train_vars+ tf.get_collection(
tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "discriminator" )
discm_only_grads = discm_only_opt.compute_gradients(discm_only_loss,
var_list = discm_only_train_vars)
tower_discm_only_grads.append(discm_only_grads)
full_grads = full_opt.compute_gradients(full_loss)
tower_full_grads.append(full_grads)
# average_gradients is the same code from the cifar10_multi_gpu_train.py.
We haven't changed it. Just iterates over gradients and averages
them...this is part of the problem...
gen_only_grads = average_gradients(tower_gen_only_grads)
gen_only_train = gen_only_opt.apply_gradients(gen_only_grads,
global_step=global_step)
discm_only_grads = average_gradients(tower_discm_only_grads)
discm_only_train = discm_only_opt.apply_gradients(discm_only_grads,
global_step=global_step)
full_grads = average_gradients(tower_full_grads)
full_train = full_opt.apply_gradients(full_grads, global_step=global_step)
If we call only "compute_gradients(full_loss)", the algorithm works properly on multiple GPUs. This is pretty equivalent to the code in the cifar10_multi_gpu_train.py example. The tricky part comes when need to restrict the network in stage 1 or 2.
Compute_gradients(full_loss), has a var_list parameter with a default value of None, which means it trains all the variables. How does it know to not train Tower_0 variables when in Tower_1? I ask, because when we deal with the compute_gradients( discm_only_loss, var_list = discm_only_train_vars), I need to know how to gather up the correct variables to restrict training to that portion of the network. I found one thread talking about this, but found it to be inaccurate/incomplete - "freeze" some variables/scopes in tensorflow: stop_gradient vs passing variables to minimize.
The reason being, that if you look at the code in compute_gradients, var_list is filled out with is a combination of trainable variables and trainable resource variables when None is passed in. So that's how I've limited it as well. This all works properly if we don't attempt to split across multiple GPUs.
Question 1:
Now that I've split the network by towers, am I responsible for gathering up the current tower as well? Do I need to add a line like this?
discm_only_train_vars= tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, "Tower_{}/discriminator".format( i ) )
discm_only_train_vars= discm_only_train_vars + tf.get_collection( tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "Tower_{}/discriminator".format( i ) )
In order to train the proper variables for tower (and ensure I don't miss the training of those variable?)
Question 2:
Probably the same answer as question 1. Getting "compute_gradients(gen_only_loss)" is a bit harder...in the non towered version, gen_only_loss never touched the discriminator, so it activated the tensors in the graph that it needed and everything was fine. However, in the towered version, when I call "compute_gradients", it returns gradients for tensors it hasn't activated yet - so some of the entries are [(None, tf.Variable), (None, tf.Variable)]. This causes average_gradients to crash because it can't convert a None value to a Tensor. This makes me think I need to restrict these as well.
The confusing thing about all of this is that the cifar example, and my full_loss example does not care about training on specific towers, but I'm guessing once I specify a var_list, any magic that compute_gradients was using to know which variables to train on which towers disappear? Do I need to worry about grabbing any other variables?
For question 1, you are responsible for gathering if you split manually, yes.
For question 2 you might want to restrict the call to compute_gradients or filter the result.
I have a pipeline to read train and validation datasets from tfrecords.
I build batches using tf.train.batch. During training I want to switch between training and evaluation on validation dataset.
Here is simplified snippet of code how I implement it now.
is_training_pl = tf.placeholder(tf.bool)
images_train, labels_train = tf.train.batch([img_train, label_train])
images_val, labels_val = tf.train.batch([img_val, label_val])
data = tf.cond(is_training_pl, lambda: [images_train, labels_train], lambda: [images_val, labels_val])
loss = my_model(input=data)
I know that one can do it with tf.cond, but the problem with it is that both train and val batch ops would be executed when tf.cond is called.
On github ebrevdo told (link to the comment) that it's possible to use tf.train.maybe_batch for this purpose instead, which is more efficient.
Can anyone give an example of how to use tf.train.batch in my case please?
Sorry for not posting entire snippets -- the code is very big and spread out, so hopefully this can illustrate my issue. I have these:
train = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](1)})
and
test = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](0)})
to my understanding, givens would input the value of train_mode (i.e. 1/0) wherever it's needed to compute the output.
The output is computed in the lines of this:
...
network2 = Net2()
# This is sort of a dummy variable so I don't get a NameError when this
# is called before `theano.function()` is called. Not sure if this is the
# right way to do this.
train_mode = T.iscalar('train_mode')
output = loss(network1.get_outputs(network2.get_outputs(X, train_mode=train_mode)),something).mean()
....
class Net2():
def get_outputs(self, x, train_mode):
from theano.ifelse import ifelse
import theano.tensor as T
my_flag = ifelse(T.eq(train_mode, 1), 1, 0)
return something if my_flag else something_else
So train_mode is used as an argument in one of the nested functions, and I use it to tell between train and test as I'd like to handle them slightly differently.
However, when I try to run this, I get this error:
theano.compile.function_module.UnusedInputError: theano.function was
asked to create a function computing outputs given certain inputs, but
the provided input variable at index 1 is not part of the computational
graph needed to compute the outputs: <TensorType(int32, scalar)>.To make
this error into a warning, you can pass the parameter
on_unused_input='warn' to theano.function. To disable it completely, use
on_unused_input='ignore'.
If I delete the givens parameter, the error disappears, so to my understanding Theano believes that my train_mode is not necessary for compute the function(). I can use on_unusued_input='ignore' as per their suggestion, but that would just ignore my train_mode if they think it's unused. Am I going around this the wrong way? I basically just want to train a neural network with dropout, but not use dropout when evaluating.
why you use "=" sign? I think, it made train_mode not readable, my code works well by writing:
givens = {train_mode:1}