I'm working on a Keras model with images separated into patches.
I have a quite peculiar pipeline:
for i in range(n_iteration):
print("Epoch:", i, "/", n_iteration)
start = time.time()
self.train_batch, self.validation_batch = self.get_batch()
end = time.time()
print("Time for loading: ",end - start)
K.set_value(self.batch_source, self.train_batch[0][:self.batch_size])
K.set_value(self.batch_target, self.train_batch[0][self.batch_size:])
pred = self.model.predict(self.train_batch[0])
K.set_value(self.gamma, self.compute_gamma(pred))
hist = self.model.train_on_batch(self.train_batch[0], self.train_batch[1])
I need to compute based on the prediction of my model at a time t (for a given batch) a certain value named gamma.This value is then taken into account in my loss function but is not differentiable, therefore I canno't integrate it's computation in my loss function.
When measuring the necessary time for loading and training, it appears that the bottleneck is in the loading phase.
My question is: Is it possible to load several batches (the function self.get_batch() while computing the prediction, gamma and training on an other batch?
I guess the idea would be to create some kind of queue in which I store my batches, but I don't really know how to do that.
PS: in my get_batch function I'm accessing an hdf5 file, can it cause any trouble in multiprocessing ?
Thank you in advance.
Related
I'm training a simple VAE model on 64*64 images and I would like to see the images generated after every epoch or every couple batches to see the progress.
when I train the model I wait until the training is done and then I look at the results.
I tried to make a custom callback function in Keras that generates an image and saves it but couldn't do it. is it even possible? I couldn't find anything like it.
it would be awesome if you refer me to a source that explains how to do so or show me an example.
Note: I'm interested in a clean Keras.callback solution and not to iterate over every epoch, train and generate the sample
If you still need it, you can define custom callback in keras as a subclass of keras.callbacks.Callback:
class CustomCallback(keras.callbacks.Callback):
def __init__(self, save_path, VAE):
self.save_path = save_path
self.VAE = VAE
def on_epoch_end(self, epoch, logs={}):
#load the image
#get latent_space with self.VAE.encoder.predict(image)
#get reconstructed image wtih self.VAE.decoder.predict(latent_space)
#plot reconstructed image with matplotlib.pyplot
Then define callback as image_callback = CustomCallback(...)
and place image_callback in the list of callbacks
Yeah its actually possible, but i always use matplotlib and a self-defined function for that. For example something like that.
for steps in range (epochs):
Train,Test = YourDataGenerator() # load your images for one loop
model.fit(Train,Test,batch_size= ...)
result = model.predict(Test_image)
plt.imshow(result[0,:,:,:]) # keras always returns [batch.nr,heigth,width,channels]
filename1 = '/content/runde2/%s_generated_plot_%06d.png' % (test, (steps+1))
plt.savefig(filename1 )
plt.close()
I think there is also a clean keras.callback version, but i always used this approach because you can use other libraries for easier data augmentation per loop. But thats just my opinion, hope i could help you at least a bit.
Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of #simlmx's:
classifier = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=model_dir,
params=params)
train_spec = tf.estimator.TrainSpec(
input_fn = training_data_input_fn,
)
eval_spec = tf.estimator.EvalSpec(
input_fn = validation_data_input_fn,
)
tf.estimator.train_and_evaluate(
classifier,
train_spec,
eval_spec
)
Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.
Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.
How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)
I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):
max_steps_without_decrease: int, maximum number of training steps with
no decrease in the given metric.
eval_dir: If set, directory
containing summary files with eval metrics. By default,
estimator.eval_dir() will be used.
Looking through the code of the hook (version 1.11), though, you find:
def stop_if_no_metric_improvement_fn():
"""Returns `True` if metric does not improve within max steps."""
eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<
best_val = None
best_val_step = None
for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
if step < min_steps:
continue
val = metrics[metric_name]
if best_val is None or is_lhs_better(val, best_val):
best_val = val
best_val_step = step
if step - best_val_step >= max_steps_without_improvement: #<<<<<
tf_logging.info(
'No %s in metric "%s" for %s steps, which is greater than or equal '
'to max steps (%s) configured for early stopping.',
increase_or_decrease, metric_name, step - best_val_step,
max_steps_without_improvement)
return True
return False
What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.
This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).
So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.
Examples with numbers to hopefully clarify more
Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.
If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.
Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement.
If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.
Keeping the best model
First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.
Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):
serving_input_receiver_fn = ... # define your serving_input_receiver_fn
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5) # this will keep the 5 best checkpoints
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]
If you don't know how to define the serving_input_fn have a look here
This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).
I know this question is asked more than one time, but I couldn't understand codes or the logic behind.
In my data set, first I created a layer, sigmoid layer, then I connected this layer to the output layer and I've used softmax function in the output layer.
fl = tf.layers.dense(x, 10,activation=tf.sigmoid)
output = tf.layers.dense(fl, 2,activation=tf.nn.softmax)
I've created loss and accuracy, initialized variables, set optimizer and train variables, then I start running on my data:
loss = tf.losses.softmax_cross_entropy(onehot_labels=y,logits=output)
accuracy = tf.metrics.accuracy(tf.argmax(y_train,1),tf.argmax(output,1))
# inits
init_local = tf.local_variables_initializer()
init_global = tf.global_variables_initializer()
sess.run(init_global)
sess.run(init_local)
optimizer = tf.train.GradientDescentOptimizer(rate)
train = optimizer.minimize(loss)
for i in range(1000):
_, lv = sess.run((train, loss))
if i%5 == 0:
print("L: " + str(lv))
print("Accuracy: "+str(sess.run(accuracy)))
I can see that my loss value decreases every time I run on the training set. And my accuracy is ~0.93.
The problem is, from now on, I don't know how to test this model with real data.
Also, how can I draw a histogram of my real data? I have correct labels for my real data as well.
I will assume that you use Dataset to feed your training data and you want to run on test data immediately after training (since you don't have checkpoints in your code).
When using Dataset, you would create an iterator and call get_next() on it. Then, you would use the return values of get_next() as inputs to your model.
To run your model on the test data, you can use two high-level approaches:
If you test data has the same format as you train data, create a dataset that reads your test data. Then, create another copy (sometimes called a "tower") of your model (operations will be new but variables will be shared) that uses the test Dataset. Then, use sess.run() similarly to how you use it for training - you might not need to compute loss or train, but only accuracy.
If you test data has different format, you can feed it directly, by using feed_dict argument to sess.run(). You would feed your test data as values for tensors returned from get_next(). Usually, one feeds placeholders, but TensorFlow allows you to feed any tensor.
As for histograms, Tensorboard has a nice way of visualizing them: https://www.tensorflow.org/programmers_guide/tensorboard_histograms.
All code is assuming Tensorflow 1.3 and Python 3.x
We are working on a GAN algorithm which has an interesting loss function.
Stage 1 - Compute only the completion/generator loss portion of the network
Iterates over the completion portion of the GAN for X iterations.
Stage 2 - Compute only the discriminator loss portion of the network
Iterates over the discriminator portion for Y iterations (but
don't train on Stage 1)
Stage 3 - Compute the full loss on the network
Iterate over both completion and discriminator for Z iterations
(training on the entire network).
We have this working single GPU. We want to make it work multi GPU since training times are long.
We have looked at the Tensorflow/models/tutorials/Images/cifar10/cifar10_multi_gpu_train.py, which talks about tower loss, averaging the towers together, computing your gradients on the GPUs then applying them on the CPU. This is a great start. However, since our loss is more complicated, it has complicated everything a bit for us.
The code is decently complicated, but is roughly similar to this, https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN, (but that won't run because it was written around Tensorflow 0.1, so it has some oddities that I haven't gotten working, but that should give you an idea of what we're doing)
When we compute gradients, it looks something like this (pseudocode to try to highlight the important portions):
for i in range(num_gpus):
with tf.device('/gpu:%d' % gpus[i]):
with tf.name_scope('Tower_%d' % gpus[i]) as scope:
with tf.variable_scope( "generator" )
generator = build_generator()
with tf.variable_scope( "discriminator" ):
with tf.variable_scope( "real_discriminator" ) :
real_discriminator = build_discriminator(x)
with tf.variable_scope( "fake_discriminator", reuse = True ):
fake_discriminator = build_discriminator(generator)
gen_only_loss, discm_only_loss, full_loss = build_loss( generator,
real_discriminator, fake_discriminator )
tf.get_variable_scope().reuse_variables()
gen_only_grads = gen_only_opt.compute_gradients(gen_only_loss)
tower_gen_only_grads.append(gen_only_grads)
discm_only_train_vars= tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, "discriminator" )
discm_only_train_vars= discm_only_train_vars+ tf.get_collection(
tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "discriminator" )
discm_only_grads = discm_only_opt.compute_gradients(discm_only_loss,
var_list = discm_only_train_vars)
tower_discm_only_grads.append(discm_only_grads)
full_grads = full_opt.compute_gradients(full_loss)
tower_full_grads.append(full_grads)
# average_gradients is the same code from the cifar10_multi_gpu_train.py.
We haven't changed it. Just iterates over gradients and averages
them...this is part of the problem...
gen_only_grads = average_gradients(tower_gen_only_grads)
gen_only_train = gen_only_opt.apply_gradients(gen_only_grads,
global_step=global_step)
discm_only_grads = average_gradients(tower_discm_only_grads)
discm_only_train = discm_only_opt.apply_gradients(discm_only_grads,
global_step=global_step)
full_grads = average_gradients(tower_full_grads)
full_train = full_opt.apply_gradients(full_grads, global_step=global_step)
If we call only "compute_gradients(full_loss)", the algorithm works properly on multiple GPUs. This is pretty equivalent to the code in the cifar10_multi_gpu_train.py example. The tricky part comes when need to restrict the network in stage 1 or 2.
Compute_gradients(full_loss), has a var_list parameter with a default value of None, which means it trains all the variables. How does it know to not train Tower_0 variables when in Tower_1? I ask, because when we deal with the compute_gradients( discm_only_loss, var_list = discm_only_train_vars), I need to know how to gather up the correct variables to restrict training to that portion of the network. I found one thread talking about this, but found it to be inaccurate/incomplete - "freeze" some variables/scopes in tensorflow: stop_gradient vs passing variables to minimize.
The reason being, that if you look at the code in compute_gradients, var_list is filled out with is a combination of trainable variables and trainable resource variables when None is passed in. So that's how I've limited it as well. This all works properly if we don't attempt to split across multiple GPUs.
Question 1:
Now that I've split the network by towers, am I responsible for gathering up the current tower as well? Do I need to add a line like this?
discm_only_train_vars= tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, "Tower_{}/discriminator".format( i ) )
discm_only_train_vars= discm_only_train_vars + tf.get_collection( tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "Tower_{}/discriminator".format( i ) )
In order to train the proper variables for tower (and ensure I don't miss the training of those variable?)
Question 2:
Probably the same answer as question 1. Getting "compute_gradients(gen_only_loss)" is a bit harder...in the non towered version, gen_only_loss never touched the discriminator, so it activated the tensors in the graph that it needed and everything was fine. However, in the towered version, when I call "compute_gradients", it returns gradients for tensors it hasn't activated yet - so some of the entries are [(None, tf.Variable), (None, tf.Variable)]. This causes average_gradients to crash because it can't convert a None value to a Tensor. This makes me think I need to restrict these as well.
The confusing thing about all of this is that the cifar example, and my full_loss example does not care about training on specific towers, but I'm guessing once I specify a var_list, any magic that compute_gradients was using to know which variables to train on which towers disappear? Do I need to worry about grabbing any other variables?
For question 1, you are responsible for gathering if you split manually, yes.
For question 2 you might want to restrict the call to compute_gradients or filter the result.
I am training a CNN with TensorFlow for medical images application.
As I don't have a lot of data, I am trying to apply random modifications to my training batch during the training loop to artificially increase my training dataset. I made the following function in a different script and call it on my training batch:
def randomly_modify_training_batch(images_train_batch, batch_size):
for i in range(batch_size):
image = images_train_batch[i]
image_tensor = tf.convert_to_tensor(image)
distorted_image = tf.image.random_flip_left_right(image_tensor)
distorted_image = tf.image.random_flip_up_down(distorted_image)
distorted_image = tf.image.random_brightness(distorted_image, max_delta=60)
distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)
with tf.Session():
images_train_batch[i] = distorted_image.eval() # .eval() is used to reconvert the image from Tensor type to ndarray
return images_train_batch
The code works well for applying modifications to my images.
The problem is :
After each iteration of my training loop (feedfoward + backpropagation), applying this same function to my next training batch steadily takes 5 seconds longer than the last time.
It takes around 1 second to process and reaches over a minute of processing after a bit more than 10 iterations.
What causes this slowing?
How can I prevent it?
(I suspect something with distorted_image.eval() but I'm not quite sure. Am opening a new session each time? TensorFlow isn't supposed to close automatically the session as I use in a "with tf.Session()" block?)
You call that code in each iteration, so each iteration you add these operations to the graph. You don't want to do that. You want to build the graph at the start and in the training loop only execute it. Also, why do you need to convert to ndimage again afterwards, instead of putting things into your TF graph once and just use tensors all the way through?