I'm trying to add some TensorBoard logging to a model which uses the new tf.estimator API.
I have a hook set up like so:
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
summary_op=tf.summary.merge_all())
# ...
classifier.train(
input_fn,
steps=1000,
hooks=[summary_hook])
In my model_fn, I am also creating a summary -
def model_fn(features, labels, mode):
# ... model stuff, calculate the value of loss
tf.summary.scalar("loss", loss)
# ...
However, when I run this code, I get the following error from the summary_hook:
Exactly one of scaffold or summary_op must be provided. This is probably because tf.summary.merge_all() is not finding any summaries and is returning None, despite the tf.summary.scalar I declared in the model_fn.
Any ideas why this wouldn't be working?
Use tf.train.Scaffold() and pass tf.merge_all as following
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
scaffold=tf.train.Scaffold(summary_op=tf.summary.merge_all()))
Just for whoever have this question in the future, the selected solution doesn't work for me (see my comments in the selected solution).
Actually, with TF 1.2 Estimator API, one doesn't need to have summary_hook. I just have tf.summary.scalar("loss", loss) in the model_fn, and run the code without summary_hook. The loss is recorded and shown in the tensorboard. I'm not sure if TF API was changed after this and similar questions.
with Tensorflow ver-r1.3
Add your summary ops in your estimator model_fn
example :
tf.summary.histogram(tensorOp.name, tensorOp)
If you feel writing summaries may consume time and space, you can control the writing frequency of summaries, in your Estimator run_config
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir=FLAGS.model_dir)
run_config = run_config.replace(save_summary_steps=150)
Note: this will affect the overall summary writer frequency for TensorBoard logging, of your estimator (tf.estimator.Estimator)
Related
I have overridden the on_step_end method in TrainerCallback from transformers library in order to be able to get the evaluation on the training set, such that I can compare training loss with accuracy after training.
The code is as follows:
class CustomCallback(TrainerCallback):
"""
Overriding the trainer callback to be able to compute training accuracy as well
Example taken from:
https://stackoverflow.com/questions/67457480/how-to-get-the-accuracy-per-epoch-or-step-for-the-huggingface-transformers-train
"""
def __init__(self, trainer) -> None:
super().__init__()
self._trainer = trainer
def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
if control.should_evaluate:
control_copy = deepcopy(control)
self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix='train')
cuda.empty_cache()
return control_copy
The add the callback to my trainer which trains my Wav2Vec model:
#Set trainig arguments/parameters
training_args = TrainingArguments(
output_dir=out_dir_models_path,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
prediction_loss_only=False,
num_train_epochs=epochs,
fp16=True, #Daniel commented
save_steps=5, #TODO change these 3 back to 10 after testing
eval_steps=5,
logging_steps=5,
learning_rate=1e-4,
save_total_limit=5,
load_best_model_at_end=True,
metric_for_best_model="eval_accuracy",
seed=seed, )
#Set data collator to pad the small recordings
trainer = CTCTrainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=processor.feature_extractor,
callbacks=[callbackTB]
)
trainer.add_callback(CustomCallback(trainer))
trainer.train()
This is causing a CUDA out of memory exception when evaluating. Without my Custom Callback, the training and evaluation work well with no memory errors.
Anyone knows why my code is breaking the evaluation loop or what to do to solve this issue?
I tried the above code (within context and datasets) and as explained, it works without my Custom Callback but gets memory exception when using custom callback.
I does work for some evaluations, but at some point over some folds, I will get the error.
I also tried to include cuda.empty_cache() after each evaluation on the training set, and the memory is being cleaned, as I am monitoring the GPU. Still, I am getting the same problem.
I tried changing eval_accumulation_steps, but this did not work.
I am also using fp16.
I'm working on HuggingFace Transformers and using toy example from here:
https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-trainer
What I actually need: ability to print input, output, grad and loss at every step.
It is trivial using Pytorch training loop, but it is not obvious using HuggingFace Trainer.
At the current moment I have next idea: create a CustomCallback like this:
class MyCallback(TrainerCallback):
"A callback that prints a grad at every step"
def on_step_begin(self, args, state, control, **kwargs):
print("next step")
print(kwargs['model'].classifier.out_proj.weight.grad.norm())
args = TrainingArguments(
output_dir='test_dir',
overwrite_output_dir=True,
num_train_epochs=1,
logging_steps=100,
report_to="none",
fp16=True,
disable_tqdm=True,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks=[MyCallback],
)
trainer.train()
This way I can print grad and weights for any model layer.
But I still can't figure out how to print input/output (for example, I want to check them on nan) and loss?
P.S. I also read something about forward_hook but still can't find good code examples for it.
While using hooks and custom callbacks is the right way to solve the problem I find better solution - use built-in utility for finding nan/Inf in losses / weights / inputs / outputs:
https://huggingface.co/transformers/internal/trainer_utils.html#transformers.debug_utils.DebugUnderflowOverflow
since the 4.6.0 transformers has such option.
You can use it manually in forward function or just use additional option for TrainingArguments like this:
args = TrainingArguments(
output_dir='test_dir',
overwrite_output_dir=True,
num_train_epochs=1,
logging_steps=100,
report_to="none",
fp16=True,
disable_tqdm=True,
debug="debug underflow_overflow"
)
thanks for your atention, I'm developing an automatic speaker recognition system using SincNet.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
Since the network is coded in Pytorch I searched and found a Keras implementation here https://github.com/grausof/keras-sincnet. I adapted the train.py code to train a Sincnet with my own data in Tensorflow 2.0, and worked fine, I saved only the weights of my trained network, my training data has shape 128,3200,1 for inputs and 128 for labels per batch
#Creates a Sincnet model with input_size=3200 (wlen), num_classes=40, fs=16000
redsinc = create_model(wlen,num_classes,fs)
#Saves only weights and stopearly callback
checkpointer = ModelCheckpoint(filepath='checkpoints/SincNetBiomex3.hdf5',verbose=1,
save_best_only=True, monitor='val_accuracy',save_weights_only=True)
stopearly = EarlyStopping(monitor='val_accuracy',patience=3,verbose=1)
callbacks = [checkpointer,stopearly]
# optimizer = RMSprop(lr=learnrate, rho=0.9, epsilon=1e-8)
optimizer = Adam(learning_rate=learnrate)
# Creates generator of training batches
train_generator = batchGenerator(batch_size,train_inputs,train_labels,wlen)
validinputs, validlabels = create_batches_rnd(validation_labels.shape[0],
validation_inputs,validation_labels,wlen)
#Compiling model and train with function fit_generator
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = redsinc.fit_generator(train_generator, steps_per_epoch=N_batches, epochs = epochs,
verbose = 1, callbacks=callbacks, validation_data=(validinputs,validlabels))
The problem came when I tried to evaluate the network, I didn't use the code found in test.py, I only loaded the weights I previously saved and use the function evaluate, my test data had the shape 1200,3200,1 for the inputs and 1200 for labels.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
test_loss, test_accuracy = redsinc.evaluate(x=eval_in,y=eval_lab)
RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer,
loss)`.
Then I added the same compile code I used for training:
optimizer = Adam(learning_rate=0.001)
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Then rerun the test code and got this:
WARNING:tensorflow:From C:\Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-
packages\tensorflow_core\python\ops\resource_variable_ops.py:1781: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is
deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
ValueError: A tf.Variable created inside your tf.function has been garbage-collected. Your code needs to keep Python references to variables created inside `tf.function`s.
A common way to raise this error is to create and return a variable only referenced inside your function:
#tf.function
def f():
v = tf.Variable(1.0)
return v
v = f() # Crashes with this error message!
The reason this crashes is that #tf.function annotated function returns a **`tf.Tensor`** with the **value** of the variable when the function is called rather than the variable instance itself. As such there is no code holding a reference to the `v` created inside the function and Python garbage collects it.
The simplest way to fix this issue is to create variables outside the function and capture them:
v = tf.Variable(1.0)
#tf.function
def f():
return v
f() # <tf.Tensor: ... numpy=1.>
v.assign_add(1.)
f() # <tf.Tensor: ... numpy=2.>
I don't understand the error since I've evaluated other networks with the same function and never got any problems. Then I decided to use predict function to match predicted labels with correct labels and obtain all metrics with my own code but I got another error.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
print('Model loaded')
#Predict labels with test data
predict_labels = redsinc.predict(eval_in)
Error while reading resource variable _AnonymousVar212 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar212/class tensorflow::Var does not exist.
[[node sinc_conv1d/concat_104/ReadVariableOp (defined at \Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_13649]
Function call stack:
keras_scratch_graph
I hope someone can tell me what these errors mean and how to solve them, I've searched for solutions to them but most of the solutions I've found don't seem related to my problem so I can't apply those solutions. I'm guessing the errors are caused by the Sincnet layer code, because it is a custom coded layer. The code for Sincnet layer can be found in the github repository in the file sincnet.py.
I appreciate all help I can get, again thank you for your atention.
You should downgrade your tf and keras version, it works to me when I faced the same problem.
Try this keras==2.1.6; tensorflow-gpu==1.13.1
I'm quite familiar in TensorFlow 1.x and I'm considering to switch to TensorFlow 2 for an upcoming project. I'm having some trouble understanding how to write scalars to TensorBoard logs with eager execution, using a custom training loop.
Problem description
In tf1 you would create some summary ops (one op for each thing you would want to store), which you would then merge into a single op, run that merged op inside a session and then write this to a file using a FileWriter object. Assuming sess is our tf.Session(), an example of how this worked can be seen below:
# While defining our computation graph, define summary ops:
# ... some ops ...
tf.summary.scalar('scalar_1', scalar_1)
# ... some more ops ...
tf.summary.scalar('scalar_2', scalar_2)
# ... etc.
# Merge all these summaries into a single op:
merged = tf.summary.merge_all()
# Define a FileWriter (i.e. an object that writes summaries to files):
writer = tf.summary.FileWriter(log_dir, sess.graph)
# Inside the training loop run the op and write the results to a file:
for i in range(num_iters):
summary, ... = sess.run([merged, ...], ...)
writer.add_summary(summary, i)
The problem is that sessions don't exist anymore in tf2 and I would prefer not disabling eager execution to make this work. The official documentation is written for tf1 and all references I can find suggest using the Tensorboard keras callback. However, as far as I know, this only works if you train the model through model.fit(...) and not through a custom training loop.
What I've tried
The tf1 version of tf.summary functions, outside of a session. Obviously any combination of these functions fails, as FileWriters, merge_ops, etc. don't even exist in tf2.
This medium post states that there has been a "cleanup" in some tensorflow APIs including tf.summary(). They suggest using from tensorflow.python.ops.summary_ops_v2, which doesn't seem to work. This implies using a record_summaries_every_n_global_steps; more on this later.
A series of other posts 1, 2, 3, suggest using the tf.contrib.summary and tf.contrib.FileWriter. However, tf.contrib has been removed from the core TensorFlow repository and build process.
A TensorFlow v2 showcase from the official repo, which again uses the tf.contrib summaries along with the record_summaries_every_n_global_steps mentioned previously. I couldn't make this to work either (even without using the contrib library).
tl;dr
My questions are:
Is there a way to properly use tf.summary in TensroFlow 2?
If not, is there another way to write TensorBoard logs in TensorFlow 2, when using a custom training loop (not model.fit())?
Yes, there is a simpler and more elegant way to use summaries in TensorFlow v2.
First, create a file writer that stores the logs (e.g. in a directory named log_dir):
writer = tf.summary.create_file_writer(log_dir)
Anywhere you want to write something to the log file (e.g. a scalar) use your good old tf.summary.scalar inside a context created by the writer. Suppose you want to store the value of scalar_1 for step i:
with writer.as_default():
tf.summary.scalar('scalar_1', scalar_1, step=i)
You can open as many of these contexts as you like inside or outside of your training loop.
Example:
# create the file writer object
writer = tf.summary.create_file_writer(log_dir)
for i, (x, y) in enumerate(train_set):
with tf.GradientTape() as tape:
y_ = model(x)
loss = loss_func(y, y_)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# write the loss value
with writer.as_default():
tf.summary.scalar('training loss', loss, step=i+1)
I'm trying to use a hook in my DNNClassifier model using tensorflow.keras.callbacks.EarlyStopping but I have no idea what to put in monitor. The documentation is not exactly helpful here.
From looking at the code a softmax cross-entropy is used as the loss function but for DNNRegressor the loss node is dnn/head/weighted_loss/Sum as per this thread. I have tried getting Tensorboard up and running but I am not able to and the import script from a saved model is equally defective on my machine.
Is there any way to figure out what the node of the DNNClassifier's loss is?
The monitor does not refer to a graph node or a layer, but to a loss or metric value. Indeed any value can be used that is present in your logs dictionary: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/callbacks.py#L676
You can inspect the values you have in logs without debugging, by using CSVLogger for instance:
csv_logger = CSVLogger(filename=os.path.join(args.log_dir, 'train.csv'), separator=',', append=False)
If you cannot write to a file, you can print out everything you have in logs to stdout:
mycallback = LambdaCallback(on_epoch_end=lambda epoch, logs: print('\n'.join(['{}: {}'.format(k, v) for k, v in logs.items()])))
In case you do not have the metric in logs, you can use LambdaCallback to put it there. For instance:
eval_callback = LambdaCallback(on_epoch_end=lambda epoch, logs: logs.update({'metric_name': get_metric_value()}))
early_stopping = EarlyStopping(monitor='metric_name', min_delta=0.0, patience=10, verbose=1, mode='min')