I'm trying to use a hook in my DNNClassifier model using tensorflow.keras.callbacks.EarlyStopping but I have no idea what to put in monitor. The documentation is not exactly helpful here.
From looking at the code a softmax cross-entropy is used as the loss function but for DNNRegressor the loss node is dnn/head/weighted_loss/Sum as per this thread. I have tried getting Tensorboard up and running but I am not able to and the import script from a saved model is equally defective on my machine.
Is there any way to figure out what the node of the DNNClassifier's loss is?
The monitor does not refer to a graph node or a layer, but to a loss or metric value. Indeed any value can be used that is present in your logs dictionary: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/callbacks.py#L676
You can inspect the values you have in logs without debugging, by using CSVLogger for instance:
csv_logger = CSVLogger(filename=os.path.join(args.log_dir, 'train.csv'), separator=',', append=False)
If you cannot write to a file, you can print out everything you have in logs to stdout:
mycallback = LambdaCallback(on_epoch_end=lambda epoch, logs: print('\n'.join(['{}: {}'.format(k, v) for k, v in logs.items()])))
In case you do not have the metric in logs, you can use LambdaCallback to put it there. For instance:
eval_callback = LambdaCallback(on_epoch_end=lambda epoch, logs: logs.update({'metric_name': get_metric_value()}))
early_stopping = EarlyStopping(monitor='metric_name', min_delta=0.0, patience=10, verbose=1, mode='min')
Related
I'm training a model and using the tensorflow callbacks function to save my training logs and I have a model checkpoint to save my model's weights.
During training, every epoch I ran it says "WARNING:tensorflow: Can save best model only with val_acc available, skipping". This is issue 1.
Here are the code I used to be include in callbacks[] during model.fit.
def create_tensorboard_callback(dir_name, experiment_name):
"""
Creates a TensorBoard callback instand to store log files.
Stores log files with the filepath:
"dir_name/experiment_name/current_datetime/"
Args:
dir_name: target directory to store TensorBoard log files
experiment_name: name of experiment directory (e.g. efficientnet_model_1)
"""
log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir
)
print(f"Saving TensorBoard log files to: {log_dir}")
return tensorboard_callback
# Create ModelCheckpoint callback to save model's progress
checkpoint_path = "model_checkpoints/cp.ckpt"
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
monitor="val_acc",
save_best_only=True, #SAVING BEST ONLY
save_weights_only=True,
verbose=0)
Code for fitting the model with callbacks:
history_101_food_classes_feature_extract = model.fit(train_data,
epochs=3,
steps_per_epoch=len(train_data),
validation_data=test_data,
validation_steps=int(0.15 * len(test_data)),
callbacks=[create_tensorboard_callback("training_logs",
"efficientnetb0_101_classes_all_data_feature_extract"),
model_checkpoint])
Also, I cloned my model and used cloned_mode.load_weights(checkpoint_path) to evaluate both orignal and cloned model results using model.evaluate(test_data) Original model scores 70+% accuracy, while cloned_model always returns this exact accuracy. This is the issue 2.
My guess was that I have some previously trained and saved a very high accuracy model, hence issue 1 where it refuses to save at every epoch. But my model_checkpoint path looks clean to me.
And, if I did previously saved a high accuracy to my checkpoint_path, when I cloned a new model using weights load from that path, why would it give 0.54 accuracy everytime and not something higher? (Issue 2)
I need help. Let me know if you need more info from my side to solve this issue, happy to answer. Thanks. If you want to see the full code, here's the link to it.
https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/07_food_vision_milestone_project_1.ipynb
I'm working on HuggingFace Transformers and using toy example from here:
https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-trainer
What I actually need: ability to print input, output, grad and loss at every step.
It is trivial using Pytorch training loop, but it is not obvious using HuggingFace Trainer.
At the current moment I have next idea: create a CustomCallback like this:
class MyCallback(TrainerCallback):
"A callback that prints a grad at every step"
def on_step_begin(self, args, state, control, **kwargs):
print("next step")
print(kwargs['model'].classifier.out_proj.weight.grad.norm())
args = TrainingArguments(
output_dir='test_dir',
overwrite_output_dir=True,
num_train_epochs=1,
logging_steps=100,
report_to="none",
fp16=True,
disable_tqdm=True,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks=[MyCallback],
)
trainer.train()
This way I can print grad and weights for any model layer.
But I still can't figure out how to print input/output (for example, I want to check them on nan) and loss?
P.S. I also read something about forward_hook but still can't find good code examples for it.
While using hooks and custom callbacks is the right way to solve the problem I find better solution - use built-in utility for finding nan/Inf in losses / weights / inputs / outputs:
https://huggingface.co/transformers/internal/trainer_utils.html#transformers.debug_utils.DebugUnderflowOverflow
since the 4.6.0 transformers has such option.
You can use it manually in forward function or just use additional option for TrainingArguments like this:
args = TrainingArguments(
output_dir='test_dir',
overwrite_output_dir=True,
num_train_epochs=1,
logging_steps=100,
report_to="none",
fp16=True,
disable_tqdm=True,
debug="debug underflow_overflow"
)
thanks for your atention, I'm developing an automatic speaker recognition system using SincNet.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
Since the network is coded in Pytorch I searched and found a Keras implementation here https://github.com/grausof/keras-sincnet. I adapted the train.py code to train a Sincnet with my own data in Tensorflow 2.0, and worked fine, I saved only the weights of my trained network, my training data has shape 128,3200,1 for inputs and 128 for labels per batch
#Creates a Sincnet model with input_size=3200 (wlen), num_classes=40, fs=16000
redsinc = create_model(wlen,num_classes,fs)
#Saves only weights and stopearly callback
checkpointer = ModelCheckpoint(filepath='checkpoints/SincNetBiomex3.hdf5',verbose=1,
save_best_only=True, monitor='val_accuracy',save_weights_only=True)
stopearly = EarlyStopping(monitor='val_accuracy',patience=3,verbose=1)
callbacks = [checkpointer,stopearly]
# optimizer = RMSprop(lr=learnrate, rho=0.9, epsilon=1e-8)
optimizer = Adam(learning_rate=learnrate)
# Creates generator of training batches
train_generator = batchGenerator(batch_size,train_inputs,train_labels,wlen)
validinputs, validlabels = create_batches_rnd(validation_labels.shape[0],
validation_inputs,validation_labels,wlen)
#Compiling model and train with function fit_generator
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = redsinc.fit_generator(train_generator, steps_per_epoch=N_batches, epochs = epochs,
verbose = 1, callbacks=callbacks, validation_data=(validinputs,validlabels))
The problem came when I tried to evaluate the network, I didn't use the code found in test.py, I only loaded the weights I previously saved and use the function evaluate, my test data had the shape 1200,3200,1 for the inputs and 1200 for labels.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
test_loss, test_accuracy = redsinc.evaluate(x=eval_in,y=eval_lab)
RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer,
loss)`.
Then I added the same compile code I used for training:
optimizer = Adam(learning_rate=0.001)
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Then rerun the test code and got this:
WARNING:tensorflow:From C:\Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-
packages\tensorflow_core\python\ops\resource_variable_ops.py:1781: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is
deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
ValueError: A tf.Variable created inside your tf.function has been garbage-collected. Your code needs to keep Python references to variables created inside `tf.function`s.
A common way to raise this error is to create and return a variable only referenced inside your function:
#tf.function
def f():
v = tf.Variable(1.0)
return v
v = f() # Crashes with this error message!
The reason this crashes is that #tf.function annotated function returns a **`tf.Tensor`** with the **value** of the variable when the function is called rather than the variable instance itself. As such there is no code holding a reference to the `v` created inside the function and Python garbage collects it.
The simplest way to fix this issue is to create variables outside the function and capture them:
v = tf.Variable(1.0)
#tf.function
def f():
return v
f() # <tf.Tensor: ... numpy=1.>
v.assign_add(1.)
f() # <tf.Tensor: ... numpy=2.>
I don't understand the error since I've evaluated other networks with the same function and never got any problems. Then I decided to use predict function to match predicted labels with correct labels and obtain all metrics with my own code but I got another error.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
print('Model loaded')
#Predict labels with test data
predict_labels = redsinc.predict(eval_in)
Error while reading resource variable _AnonymousVar212 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar212/class tensorflow::Var does not exist.
[[node sinc_conv1d/concat_104/ReadVariableOp (defined at \Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_13649]
Function call stack:
keras_scratch_graph
I hope someone can tell me what these errors mean and how to solve them, I've searched for solutions to them but most of the solutions I've found don't seem related to my problem so I can't apply those solutions. I'm guessing the errors are caused by the Sincnet layer code, because it is a custom coded layer. The code for Sincnet layer can be found in the github repository in the file sincnet.py.
I appreciate all help I can get, again thank you for your atention.
You should downgrade your tf and keras version, it works to me when I faced the same problem.
Try this keras==2.1.6; tensorflow-gpu==1.13.1
My model has two outputs, I want to monitor one to save my model.
Below is part of my code. The version of TensorFlow is 2.0
model = MobileNetBaseModel()()
model.compile(optimizer=tf.keras.optimizers.Adam(),
metrics={"pitch_yaw_roll": "mae"},
loss={"pitch_yaw_roll": compute_mse_loss, # or "mse"
"total_logits": compute_cross_entropy_loss(num_classes=num_classes)},
loss_weights= {"pitch_yaw_roll":mse_weight, "total_logits":cross_entropy_weight})
file_path = os.path.join(checkpoint_path, "model.{epoch:2d}-{val_loss:.2f}.h5")
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_loss",
verbose=1,
save_freq=save_freq,
save_best_only=True)
The default monitor='val_loss' in the ModelCheckpoint callback, how do I choose what I need? I want to monitor {"pitch_yaw_roll": "mae"}.
If you want ModelCheckpoint to save according to another metric value, use the key of that metric in the .compile(metrics={...}, ...) metrics dictionary.
So for example, if you would like to save only the best "pitch_yaw_roll" epoch result (best being the minimum value) you should use
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_pitch_yaw_roll",
verbose=1,
mode="min",
save_freq=save_freq,
save_best_only=True)
If you opt for "pitch_yaw_roll" instead of "val_pitch_yaw_roll" it will save according to the training loss and not according to the validation loss
Just adding to comment above, I belive your checkpoint doesn't work because of incorrect name of value to monitor.
General, solution here might be to have a peak into history that your fit creates.
history = model.fit(...)
pd.DataFrame(history.history)
there you will find names of metrics you should use in monitor statement.
I'm trying to add some TensorBoard logging to a model which uses the new tf.estimator API.
I have a hook set up like so:
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
summary_op=tf.summary.merge_all())
# ...
classifier.train(
input_fn,
steps=1000,
hooks=[summary_hook])
In my model_fn, I am also creating a summary -
def model_fn(features, labels, mode):
# ... model stuff, calculate the value of loss
tf.summary.scalar("loss", loss)
# ...
However, when I run this code, I get the following error from the summary_hook:
Exactly one of scaffold or summary_op must be provided. This is probably because tf.summary.merge_all() is not finding any summaries and is returning None, despite the tf.summary.scalar I declared in the model_fn.
Any ideas why this wouldn't be working?
Use tf.train.Scaffold() and pass tf.merge_all as following
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
scaffold=tf.train.Scaffold(summary_op=tf.summary.merge_all()))
Just for whoever have this question in the future, the selected solution doesn't work for me (see my comments in the selected solution).
Actually, with TF 1.2 Estimator API, one doesn't need to have summary_hook. I just have tf.summary.scalar("loss", loss) in the model_fn, and run the code without summary_hook. The loss is recorded and shown in the tensorboard. I'm not sure if TF API was changed after this and similar questions.
with Tensorflow ver-r1.3
Add your summary ops in your estimator model_fn
example :
tf.summary.histogram(tensorOp.name, tensorOp)
If you feel writing summaries may consume time and space, you can control the writing frequency of summaries, in your Estimator run_config
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir=FLAGS.model_dir)
run_config = run_config.replace(save_summary_steps=150)
Note: this will affect the overall summary writer frequency for TensorBoard logging, of your estimator (tf.estimator.Estimator)