All code is assuming Tensorflow 1.3 and Python 3.x
We are working on a GAN algorithm which has an interesting loss function.
Stage 1 - Compute only the completion/generator loss portion of the network
Iterates over the completion portion of the GAN for X iterations.
Stage 2 - Compute only the discriminator loss portion of the network
Iterates over the discriminator portion for Y iterations (but
don't train on Stage 1)
Stage 3 - Compute the full loss on the network
Iterate over both completion and discriminator for Z iterations
(training on the entire network).
We have this working single GPU. We want to make it work multi GPU since training times are long.
We have looked at the Tensorflow/models/tutorials/Images/cifar10/cifar10_multi_gpu_train.py, which talks about tower loss, averaging the towers together, computing your gradients on the GPUs then applying them on the CPU. This is a great start. However, since our loss is more complicated, it has complicated everything a bit for us.
The code is decently complicated, but is roughly similar to this, https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN, (but that won't run because it was written around Tensorflow 0.1, so it has some oddities that I haven't gotten working, but that should give you an idea of what we're doing)
When we compute gradients, it looks something like this (pseudocode to try to highlight the important portions):
for i in range(num_gpus):
with tf.device('/gpu:%d' % gpus[i]):
with tf.name_scope('Tower_%d' % gpus[i]) as scope:
with tf.variable_scope( "generator" )
generator = build_generator()
with tf.variable_scope( "discriminator" ):
with tf.variable_scope( "real_discriminator" ) :
real_discriminator = build_discriminator(x)
with tf.variable_scope( "fake_discriminator", reuse = True ):
fake_discriminator = build_discriminator(generator)
gen_only_loss, discm_only_loss, full_loss = build_loss( generator,
real_discriminator, fake_discriminator )
tf.get_variable_scope().reuse_variables()
gen_only_grads = gen_only_opt.compute_gradients(gen_only_loss)
tower_gen_only_grads.append(gen_only_grads)
discm_only_train_vars= tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, "discriminator" )
discm_only_train_vars= discm_only_train_vars+ tf.get_collection(
tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "discriminator" )
discm_only_grads = discm_only_opt.compute_gradients(discm_only_loss,
var_list = discm_only_train_vars)
tower_discm_only_grads.append(discm_only_grads)
full_grads = full_opt.compute_gradients(full_loss)
tower_full_grads.append(full_grads)
# average_gradients is the same code from the cifar10_multi_gpu_train.py.
We haven't changed it. Just iterates over gradients and averages
them...this is part of the problem...
gen_only_grads = average_gradients(tower_gen_only_grads)
gen_only_train = gen_only_opt.apply_gradients(gen_only_grads,
global_step=global_step)
discm_only_grads = average_gradients(tower_discm_only_grads)
discm_only_train = discm_only_opt.apply_gradients(discm_only_grads,
global_step=global_step)
full_grads = average_gradients(tower_full_grads)
full_train = full_opt.apply_gradients(full_grads, global_step=global_step)
If we call only "compute_gradients(full_loss)", the algorithm works properly on multiple GPUs. This is pretty equivalent to the code in the cifar10_multi_gpu_train.py example. The tricky part comes when need to restrict the network in stage 1 or 2.
Compute_gradients(full_loss), has a var_list parameter with a default value of None, which means it trains all the variables. How does it know to not train Tower_0 variables when in Tower_1? I ask, because when we deal with the compute_gradients( discm_only_loss, var_list = discm_only_train_vars), I need to know how to gather up the correct variables to restrict training to that portion of the network. I found one thread talking about this, but found it to be inaccurate/incomplete - "freeze" some variables/scopes in tensorflow: stop_gradient vs passing variables to minimize.
The reason being, that if you look at the code in compute_gradients, var_list is filled out with is a combination of trainable variables and trainable resource variables when None is passed in. So that's how I've limited it as well. This all works properly if we don't attempt to split across multiple GPUs.
Question 1:
Now that I've split the network by towers, am I responsible for gathering up the current tower as well? Do I need to add a line like this?
discm_only_train_vars= tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, "Tower_{}/discriminator".format( i ) )
discm_only_train_vars= discm_only_train_vars + tf.get_collection( tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "Tower_{}/discriminator".format( i ) )
In order to train the proper variables for tower (and ensure I don't miss the training of those variable?)
Question 2:
Probably the same answer as question 1. Getting "compute_gradients(gen_only_loss)" is a bit harder...in the non towered version, gen_only_loss never touched the discriminator, so it activated the tensors in the graph that it needed and everything was fine. However, in the towered version, when I call "compute_gradients", it returns gradients for tensors it hasn't activated yet - so some of the entries are [(None, tf.Variable), (None, tf.Variable)]. This causes average_gradients to crash because it can't convert a None value to a Tensor. This makes me think I need to restrict these as well.
The confusing thing about all of this is that the cifar example, and my full_loss example does not care about training on specific towers, but I'm guessing once I specify a var_list, any magic that compute_gradients was using to know which variables to train on which towers disappear? Do I need to worry about grabbing any other variables?
For question 1, you are responsible for gathering if you split manually, yes.
For question 2 you might want to restrict the call to compute_gradients or filter the result.
Related
I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:
mean_squared_error(signal_true,signal_pred)
but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):
[0.013499056920409203, 0.07980187237262726, 0.013792216777801514]
the error from individual target (outputs):
Target0 0.167851388666284
Target1 0.6068108648555771
Target2 0.1710370357827747
Target3 2.747463225418181
Target4 1.7965991690103074
Target5 0.9065426398192563
I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.
thanks
There are a number of reasons that you can have differences between the loss for training and evaluation.
Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
It could be due to differences in the datasets if the split isn't random.
You may be using different metrics without realizing it.
I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.
I had the same problem and found a solution. Hopefully this is the same problem you encountered.
It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).
>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833
My quick and dirty solution:
valid_generator.reset() # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)): # Necessary for avoiding infinite loop
x = next(valid_generator)
pred_i = model.predict(x[0])[:,0]
labels_i = x[1]
all_labels.append(labels_i)
all_pred.append(pred_i)
print(np.shape(pred_i), np.shape(labels_i))
cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)
The result:
>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352
This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.
I am training a neural network with tensorflow (1.12) in a supervised fashion. I'd like to only train on specific examples. The examples are created on the fly by cutting out subsequences, hence I want to do the conditioning within tensorflow.
This is my original part of code:
train_step, gvs = minimize_clipped(optimizer, loss,
clip_value=FLAGS.gradient_clip,
return_gvs=True)
gradients = [g for (g,v) in gvs]
gradient_norm = tf.global_norm(gradients)
tf.summary.scalar('gradients/norm', gradient_norm)
eval_losses = {'loss1': loss1,
'loss2': loss2}
The training step is later executed as:
batch_eval, _ = sess.run([eval_losses, train_step])
I was thinking about inserting something like
train_step_fake = ????
eval_losses_fake = tf.zeros_like(tensor)
train_step_new = tf.cond(my_cond, train_step, train_step_fake)
eval_losses_new = tf.cond(my_cond, eval_losses, eval_losses_fake)
and then doing
batch_eval, _ = sess.run([eval_losses, train_step])
However, I am not sure how to create a fake train_step.
Also, is this a good idea in general or is there a smoother way of doing this? I am using a tfrecords pipeline, but no other high-level modules (like keras, tf.estimator, eager execution etc.).
Any help is obviously greatly appreciated!
Answering the specific question first. It's certainly possible to only perform your training step based on the tf.cond outcome. Note that the 2nd and 3rd params are lambdas though so more something like:
train_step_new = tf.cond(my_cond, lambda: train_step, lambda: train_step_fake)
eval_losses_new = tf.cond(my_cond, lambda: eval_losses, lambda: eval_losses_fake)
Your instinct that this may not be the right thing to do is correct though.
It's much more preferable (both in terms of efficiency and in terms of reading and reasoning about your code) to filter out the data you want to ignore before it gets to your model in the first place.
This is something you could achieve using the Dataset API. which has a really useful filter() method you could use. If you are using the dataset api to read your TFRecords right now then this should be as simple as adding something along the lines of:
dataset = dataset.filter(lambda x: {whatever op you were going to use in tf.cond})
If you are not yet using the dataset API, now is probably the time to have a little read up on it and consider it rather than butchering the model with that tf.cond() to act as a filter.
Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of #simlmx's:
classifier = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=model_dir,
params=params)
train_spec = tf.estimator.TrainSpec(
input_fn = training_data_input_fn,
)
eval_spec = tf.estimator.EvalSpec(
input_fn = validation_data_input_fn,
)
tf.estimator.train_and_evaluate(
classifier,
train_spec,
eval_spec
)
Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.
Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.
How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)
I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):
max_steps_without_decrease: int, maximum number of training steps with
no decrease in the given metric.
eval_dir: If set, directory
containing summary files with eval metrics. By default,
estimator.eval_dir() will be used.
Looking through the code of the hook (version 1.11), though, you find:
def stop_if_no_metric_improvement_fn():
"""Returns `True` if metric does not improve within max steps."""
eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<
best_val = None
best_val_step = None
for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
if step < min_steps:
continue
val = metrics[metric_name]
if best_val is None or is_lhs_better(val, best_val):
best_val = val
best_val_step = step
if step - best_val_step >= max_steps_without_improvement: #<<<<<
tf_logging.info(
'No %s in metric "%s" for %s steps, which is greater than or equal '
'to max steps (%s) configured for early stopping.',
increase_or_decrease, metric_name, step - best_val_step,
max_steps_without_improvement)
return True
return False
What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.
This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).
So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.
Examples with numbers to hopefully clarify more
Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.
If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.
Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement.
If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.
Keeping the best model
First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.
Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):
serving_input_receiver_fn = ... # define your serving_input_receiver_fn
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5) # this will keep the 5 best checkpoints
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]
If you don't know how to define the serving_input_fn have a look here
This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).
I am trying to train a network with tensorflow with multiple towers. I had set reuse = True for all the towers. But in the cifar10 multi gpu train of tensorflow tutorials, the reuse variable has set after the first tower was created:
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
# Actually the logits (whole network) is defined in tower_loss
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
Does it make any difference? What happens if we set reuse=True beforehand?
You need to have reuse=False for the first run to generate variables. It is an error if reuse=True but the variable is not yet constructed.
If you use a newer version of tensorflow (>1.4 I think) you can use reuse=tf.AUTO_REUSE and it will do the magic for you.
I'm not sure how this interacts with the multi device setup you have. Double check if the variable names don't become prefixed by the device. In that case there's no reuse, each device has a different variable.
There are two ways to share variables.
Either version 1:
with tf.variable_scope("model"):
output1 = my_image_filter(input1)
with tf.variable_scope("model", reuse=True):
output2 = my_image_filter(input2)
or version 2:
with tf.variable_scope("model") as scope:
output1 = my_image_filter(input1)
scope.reuse_variables()
output2 = my_image_filter(input2)
Both methods share the variable. The second method is used in the Cifar10 tutorial because it is much cleaner (and that's only my opinion). You can try to rebuild it with version 1, the code will probably be less readable.
I am trying to run a tsne analysis on a square distance matrix. These are the commands I am using.
model = TSNE(n_components = 2,perplexity = 32, verbose = 10,n_iter = 1000, metric = "precomputed")
embeddings = model.fit_transform(D)
This is the output I receive: output from tsne function
It looks like the program is running through 75 iterations and then calling it good and quitting. When I plot the data from the tsne it's basically just a single dense blob. Why is the program quitting early and how can I make it run longer?
It's quitting because the exit-condition is reached.
Interpreting the log, the exit-condition is probably a metric on the gradient, called gradient-norm here. If needed, checkout the basics of gradient-descent to understand the intuition. As every iteration is making a step towards the negative of the gradient, tiny gradients will not do much to the objective (and will be interpreted as: we found a local/global minimum).
It looks like (still interpreting your log only):
if np.linalg.norm(gradient) < 1e-4:
return solution
There is no merit to further do more iterations for this parameterization of the optimization-problem. The solution won't get better (in terms of minimization).
You can only try other parameters (resulting in other optimization-problems).