How to write to TensorBoard in TensorFlow 2 - python

I'm quite familiar in TensorFlow 1.x and I'm considering to switch to TensorFlow 2 for an upcoming project. I'm having some trouble understanding how to write scalars to TensorBoard logs with eager execution, using a custom training loop.
Problem description
In tf1 you would create some summary ops (one op for each thing you would want to store), which you would then merge into a single op, run that merged op inside a session and then write this to a file using a FileWriter object. Assuming sess is our tf.Session(), an example of how this worked can be seen below:
# While defining our computation graph, define summary ops:
# ... some ops ...
tf.summary.scalar('scalar_1', scalar_1)
# ... some more ops ...
tf.summary.scalar('scalar_2', scalar_2)
# ... etc.
# Merge all these summaries into a single op:
merged = tf.summary.merge_all()
# Define a FileWriter (i.e. an object that writes summaries to files):
writer = tf.summary.FileWriter(log_dir, sess.graph)
# Inside the training loop run the op and write the results to a file:
for i in range(num_iters):
summary, ... = sess.run([merged, ...], ...)
writer.add_summary(summary, i)
The problem is that sessions don't exist anymore in tf2 and I would prefer not disabling eager execution to make this work. The official documentation is written for tf1 and all references I can find suggest using the Tensorboard keras callback. However, as far as I know, this only works if you train the model through model.fit(...) and not through a custom training loop.
What I've tried
The tf1 version of tf.summary functions, outside of a session. Obviously any combination of these functions fails, as FileWriters, merge_ops, etc. don't even exist in tf2.
This medium post states that there has been a "cleanup" in some tensorflow APIs including tf.summary(). They suggest using from tensorflow.python.ops.summary_ops_v2, which doesn't seem to work. This implies using a record_summaries_every_n_global_steps; more on this later.
A series of other posts 1, 2, 3, suggest using the tf.contrib.summary and tf.contrib.FileWriter. However, tf.contrib has been removed from the core TensorFlow repository and build process.
A TensorFlow v2 showcase from the official repo, which again uses the tf.contrib summaries along with the record_summaries_every_n_global_steps mentioned previously. I couldn't make this to work either (even without using the contrib library).
tl;dr
My questions are:
Is there a way to properly use tf.summary in TensroFlow 2?
If not, is there another way to write TensorBoard logs in TensorFlow 2, when using a custom training loop (not model.fit())?

Yes, there is a simpler and more elegant way to use summaries in TensorFlow v2.
First, create a file writer that stores the logs (e.g. in a directory named log_dir):
writer = tf.summary.create_file_writer(log_dir)
Anywhere you want to write something to the log file (e.g. a scalar) use your good old tf.summary.scalar inside a context created by the writer. Suppose you want to store the value of scalar_1 for step i:
with writer.as_default():
tf.summary.scalar('scalar_1', scalar_1, step=i)
You can open as many of these contexts as you like inside or outside of your training loop.
Example:
# create the file writer object
writer = tf.summary.create_file_writer(log_dir)
for i, (x, y) in enumerate(train_set):
with tf.GradientTape() as tape:
y_ = model(x)
loss = loss_func(y, y_)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# write the loss value
with writer.as_default():
tf.summary.scalar('training loss', loss, step=i+1)

Related

How to use evaluate and predict functions in keras implementation of SincNet?

thanks for your atention, I'm developing an automatic speaker recognition system using SincNet.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
Since the network is coded in Pytorch I searched and found a Keras implementation here https://github.com/grausof/keras-sincnet. I adapted the train.py code to train a Sincnet with my own data in Tensorflow 2.0, and worked fine, I saved only the weights of my trained network, my training data has shape 128,3200,1 for inputs and 128 for labels per batch
#Creates a Sincnet model with input_size=3200 (wlen), num_classes=40, fs=16000
redsinc = create_model(wlen,num_classes,fs)
#Saves only weights and stopearly callback
checkpointer = ModelCheckpoint(filepath='checkpoints/SincNetBiomex3.hdf5',verbose=1,
save_best_only=True, monitor='val_accuracy',save_weights_only=True)
stopearly = EarlyStopping(monitor='val_accuracy',patience=3,verbose=1)
callbacks = [checkpointer,stopearly]
# optimizer = RMSprop(lr=learnrate, rho=0.9, epsilon=1e-8)
optimizer = Adam(learning_rate=learnrate)
# Creates generator of training batches
train_generator = batchGenerator(batch_size,train_inputs,train_labels,wlen)
validinputs, validlabels = create_batches_rnd(validation_labels.shape[0],
validation_inputs,validation_labels,wlen)
#Compiling model and train with function fit_generator
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = redsinc.fit_generator(train_generator, steps_per_epoch=N_batches, epochs = epochs,
verbose = 1, callbacks=callbacks, validation_data=(validinputs,validlabels))
The problem came when I tried to evaluate the network, I didn't use the code found in test.py, I only loaded the weights I previously saved and use the function evaluate, my test data had the shape 1200,3200,1 for the inputs and 1200 for labels.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
test_loss, test_accuracy = redsinc.evaluate(x=eval_in,y=eval_lab)
RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer,
loss)`.
Then I added the same compile code I used for training:
optimizer = Adam(learning_rate=0.001)
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Then rerun the test code and got this:
WARNING:tensorflow:From C:\Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-
packages\tensorflow_core\python\ops\resource_variable_ops.py:1781: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is
deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
ValueError: A tf.Variable created inside your tf.function has been garbage-collected. Your code needs to keep Python references to variables created inside `tf.function`s.
A common way to raise this error is to create and return a variable only referenced inside your function:
#tf.function
def f():
v = tf.Variable(1.0)
return v
v = f() # Crashes with this error message!
The reason this crashes is that #tf.function annotated function returns a **`tf.Tensor`** with the **value** of the variable when the function is called rather than the variable instance itself. As such there is no code holding a reference to the `v` created inside the function and Python garbage collects it.
The simplest way to fix this issue is to create variables outside the function and capture them:
v = tf.Variable(1.0)
#tf.function
def f():
return v
f() # <tf.Tensor: ... numpy=1.>
v.assign_add(1.)
f() # <tf.Tensor: ... numpy=2.>
I don't understand the error since I've evaluated other networks with the same function and never got any problems. Then I decided to use predict function to match predicted labels with correct labels and obtain all metrics with my own code but I got another error.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
print('Model loaded')
#Predict labels with test data
predict_labels = redsinc.predict(eval_in)
Error while reading resource variable _AnonymousVar212 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar212/class tensorflow::Var does not exist.
[[node sinc_conv1d/concat_104/ReadVariableOp (defined at \Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_13649]
Function call stack:
keras_scratch_graph
I hope someone can tell me what these errors mean and how to solve them, I've searched for solutions to them but most of the solutions I've found don't seem related to my problem so I can't apply those solutions. I'm guessing the errors are caused by the Sincnet layer code, because it is a custom coded layer. The code for Sincnet layer can be found in the github repository in the file sincnet.py.
I appreciate all help I can get, again thank you for your atention.
You should downgrade your tf and keras version, it works to me when I faced the same problem.
Try this keras==2.1.6; tensorflow-gpu==1.13.1

How to use feed_dict in Tensorflow multiple GPU case

Recently, I try to learn how to use Tensorflow on multiple GPU to accelerate training speed. I found an official tutorial about training classification model based on Cifar10 dataset. However, I found that this tutorial reads image by using the queue. Out of curiosity, how can I use multiple GPU by feeding value into Session? It seems that it is hard for me to solve the problem that feeds different value from the same dataset to different GPU. Thank you, everybody! The following code is about part of the official tutorial.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
# Calculate the gradients for the batch of data on this CIFAR tower.
grads = opt.compute_gradients(loss)
# Keep track of the gradients across all towers.
tower_grads.append(grads)
The core idea of the multi-GPU example is that you explicitly assign operations to a tf.device. The example loops over FLAGS.num_gpus devices and creates a replica for each of the GPUs.
If you create placeholder ops inside the for loop, they will get assigned to their respective devices. All you need to do is keep handles to the created placeholders and then feed them all independently in a single session.run call.
placeholders = []
for i in range(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
plc = tf.placeholder(tf.int32)
placeholders.append(plc)
with tf.Session() as sess:
fd = {plc: i for i, plc in enumerate(placeholders)}
sess.run(sum(placeholders), feed_dict=fd) # this should give you the sum of all
# numbers from 0 to FLAGS.num_gpus - 1
To address your specific example, it should suffice to replace the batch_queue.dequeue() call with the construction of two placeholders (for image_batch and label_batch tensors), store these placeholders somewhere, and then feed the values you need to those.
Another (somewhat hacky) way is to override the image_batch and label_batch tensors directly in the session.run call, because you can feed_dict any tensor (not just a placeholder). You will still need to store the tensors somewhere to be able to reference them from the run call.
QueueRunner and Queue-based API is relatively out-dated, it is clearly mentioned in Tensorflow docs:
Input pipelines using the queue-based APIs can be cleanly
replaced by the tf.data API
As a result, it is recommended to use tf.data API. It optimized for multi GPU and TPU purposes.
How to use it?
dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train))
iterator = dataset.make_one_shot_iterator()
x,y = iterator.get_next()
# define your model
logit = tf.layers.dense(x,2) # use x directrly in your model
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
train_step = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
sess.run(train_step)
You can create multiple iterator for each GPU with Dataset.shard() or more easily use estimator API.
For a complete tutorial see here.

Tensoflow Estimator: how to use tf.graph_util.convert_variables_to_constants

I would like to know if it is possible to use the function tf.graph_util.convert_variables_to_constants (in order to store the frozen version of the graph) in a train/evaluation loop, while I'm using a custom estimators. For example:
best_validation_accuracy = -1
for _ in range(steps // how_often_validation):
# Train the model
estimator.train(input_fn=train_input_fn, steps=how_often_validation)
# Evaluate the model
validation_accuracy = estimator.evaluate(input_fn=eval_input_fn)
# Save best model
if validation_accuracy["accuracy"] > best_validation_accuracy:
best_validation_accuracy = validation_accuracy["accuracy"]
# Save best model perfomances
# I WANT TO USE tf.graph_util.convert_variables_to_constants HERE
To use the function tf.graph_util.convert_variables_to_constants, you need the graph and the session of your model.
After going through the TensorFlow code defining the estimators, it appears that:
This code is deprecated,
The graph is created on the fly and not easily accessible (at least, I was not able to retrieve it).
Thus, we will have to use the good old method.
When you call estimator.train, checkpoints of your model are being saved in a specified directory (estimator.model_dir). You can use those files to access the graph and session and freeze the variables as follow:
1. Load meta graph
saver = tf.train.import_meta_graph('/path/to/meta')
2. Load weights
sess = tf.Session
saver.restore(sess, '/path/to/weights')
3. Freeze variables
tf.graph_util.convert_variables_to_constants(sess,
sess.graph.as_graph_def(),
['output'])

Adding Tensorboard summaries from graph ops generated inside Dataset map() function calls

I've found the Dataset.map() functionality pretty nice for setting up pipelines to preprocess image/audio data before feeding into the network for training, but one issue I have is accessing the raw data before the preprocessing to send to tensorboard as a summary.
For example, say I have a function that loads audio data, does some framing, makes a spectrogram, and returns this.
import tensorflow as tf
def load_audio_examples(label, path):
# loads audio, converts to spectorgram
pcm = ... # this is what I'd like to put into tf.summmary.audio() !
# creates one-hot encoded labels, etc
return labels, examples
# create dataset
training = tf.data.Dataset.from_tensor_slices((
tf.constant(labels),
tf.constant(paths)
))
training = training.map(load_audio_examples, num_parallel_calls=4)
# create ops for training
train_step = # ...
accuracy = # ...
# create iterator
iterator = training.repeat().make_one_shot_iterator()
next_element = iterator.get_next()
# ready session
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
train_writer = # ...
# iterator
test_iterator = testing.make_one_shot_iterator()
test_next_element = iterator.get_next()
# train loop
for i in range(100):
batch_ys, batch_xs, path = sess.run(next_element)
summary, train_acc, _ = sess.run([summaries, accuracy, train_step],
feed_dict={x: batch_xs, y: batch_ys})
train_writer.add_summary(summary, i)
It appears as though this does not become part of the graph that is plotted in the "Graph" tab of tensorboard (see screenshot below).
As you can see, it's just X (the output of the preprocessing map() function).
How would I better structure this to get the raw audio into a tf.summary.audio()? Right now the things inside map() aren't accessible as Tensors inside my training loop.
Also, why isn't my graph showing up on Tensorboard? Worries me that I won't be able to export my model or use Tensorflow Serving to put my model into production because I'm using the new Dataset API - maybe I should go back to doing things manually? (with queues, etc).
I think your use of Dataset API doesn't make much sense. In fact you have 2 disconnected subgraphs. One for reading data and the other for running your training step.
batch_ys, batch_xs, path = sess.run(next_element)
summary, train_acc, _ = sess.run([summaries, accuracy, train_step],
feed_dict={x: batch_xs, y: batch_ys})
The first line in the code above runs session and fetches data items from it. It transfers data from Tensorflow backend into Python.
The next line feeds data using feed_dict and that is said to be inefficient. This time TensorFlow transfers data from Python to runtime.
This has the following consequences:
Your graph looks disconnected
TensorFlow wastes time doing unnecessary data transfer to and from Python.
To have a single graph (without disconnected subgraphs) you need to build your model on top of tensors returned by Dataset API. Please note that it is possible to switch between training and testing datasets without manual fetching of batches (see Dataset guide)
If to speak about summary defined in map_fn I believe you can retrieve summary from SUMMARIES collection (default collection for summaries). You can also pass your own collection name when adding summary operation.

Tensorflow - Using tf.summary with 1.2 Estimator API

I'm trying to add some TensorBoard logging to a model which uses the new tf.estimator API.
I have a hook set up like so:
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
summary_op=tf.summary.merge_all())
# ...
classifier.train(
input_fn,
steps=1000,
hooks=[summary_hook])
In my model_fn, I am also creating a summary -
def model_fn(features, labels, mode):
# ... model stuff, calculate the value of loss
tf.summary.scalar("loss", loss)
# ...
However, when I run this code, I get the following error from the summary_hook:
Exactly one of scaffold or summary_op must be provided. This is probably because tf.summary.merge_all() is not finding any summaries and is returning None, despite the tf.summary.scalar I declared in the model_fn.
Any ideas why this wouldn't be working?
Use tf.train.Scaffold() and pass tf.merge_all as following
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
scaffold=tf.train.Scaffold(summary_op=tf.summary.merge_all()))
Just for whoever have this question in the future, the selected solution doesn't work for me (see my comments in the selected solution).
Actually, with TF 1.2 Estimator API, one doesn't need to have summary_hook. I just have tf.summary.scalar("loss", loss) in the model_fn, and run the code without summary_hook. The loss is recorded and shown in the tensorboard. I'm not sure if TF API was changed after this and similar questions.
with Tensorflow ver-r1.3
Add your summary ops in your estimator model_fn
example :
tf.summary.histogram(tensorOp.name, tensorOp)
If you feel writing summaries may consume time and space, you can control the writing frequency of summaries, in your Estimator run_config
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir=FLAGS.model_dir)
run_config = run_config.replace(save_summary_steps=150)
Note: this will affect the overall summary writer frequency for TensorBoard logging, of your estimator (tf.estimator.Estimator)

Categories

Resources