Unable to debug where torch Adam optimiser is going wrong

Unable to debug where torch Adam optimiser is going wrong - python

I was implementing a training loop in vscode. I have created a Adam optimizer using XLM-Roberta model as follows:
xlm_r_model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base",
num_labels = NUM_LABELS,
output_attentions=False,
output_hidden_states=False
)
xlm_r_model.to(device)
optimizer = torch.optim.Adam(xlm_r_model.parameters(), lr=LR)
Then at following line:
optimizer.step()
vscode simply terminates the execution, without any error stack trace.
So I debugged to get to know exactly where this is happening. I reached this line, which makes F.adam(...) call:
Weirdly, on github, torch.optim.adam does not have this line. It seems that the closest matching line is line 150.
This call then goes to torch.optim._functional.adam:
In above image, those params (line 72) in for loop contains 201 elements and am unable to figure it out exactly which param is going wrong. When I continue it to run, it doesn't pause in debug mode whenever error occurs, instead vscode simply terminates.
Again, I am not able to find this function on github's _functional version
When I checked several Kaggle notebooks (1,2,3,4) for training xlm roberta, they are using AdamW and torch_xla package to train on TPUs something like this:
import torch_xla.core.xla_model as xm
optimizer = AdamW([{'params': model.roberta.parameters(), 'lr': LR},
{'params': [param for name, param in model.named_parameters() if 'roberta' not in name], 'lr': 1e-3} ], lr=LR, weight_decay=0)
xm.optimizer_step(optimizer)
Do I miss some contenxt and it is indeed compulsory to train using AdamW or torch_xla? Or am doing some stupid mistake?
PS:
Am running this no colab . Its pip shows torch version 1.10.0+cu111 and python 3.7.13. I have run codeserver on colab through colabcode and debugging in browser based vscode.
I was able to train Bert with Adam optimizer earlier.

Related

No console output using Keras model.fit() function

I'm following this tutorial to perform time series classifications using Transformers with Keras and TensorFlow. I'm using Windows 10 and the PyDev Eclipse plugin. Unfortunately, my program stops and the console output is completely blank every time I run the following code:
n_classes = len(np.unique(y_train))
input_shape = np.array(x_trainScaled).shape[0:]
model = build_model(n_classes,input_shape,head_size=256,num_heads=4,ff_dim=4,num_transformer_blocks=4,mlp_units=[128],mlp_dropout=0.4,dropout=0.25)
model.compile(loss="sparse_categorical_crossentropy",optimizer=keras.optimizers.Adam(learning_rate=1e-4),metrics=["sparse_categorical_accuracy"])
print(model.summary())
callbacks = [keras.callbacks.EarlyStopping(patience=100, restore_best_weights=True)]
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
pathToModel = 'my/path/to/model/'
model.save(pathToModel)
Even previous warnings or print statements are completely erased and I have no idea what's going on. If I comment the model.fit(...) statement out, the program terminates and crashes with an error message resulting from a model.predict(...) call.
Any help is highly appreciated.

The solution was to transform the input data and labels to numpy arrays first. Thus, calling the fit function as follows:
model.fit(np.array(x_trainScaled),np.array(y_train),validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
worked perfectly fine for me, as opposed to:
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)

Different training results on TF 2.4.1/2.2.0 and 2.0.0 with the same script

How are you? Hope y'all good.
I have a very specific scenario and it would be awesome to have your input. I'm working on a project in which I've developed a training script using TF 2.4.1. Just to give you context, I used MobileNetV2 as the base model to feature extraction and just a Dense layer with one neuron, and, with time, it got necessary to have different base models so as to perform a benchmark, so I started using different base models, such as InceptionV3 and VGG16, defining it as a command option while running my script. However, so as to the others in my research team to run my script, I had to use TF 2.0.0 due to a software limitation on the laboratory machine (CUDA 10 instead of 11), and that's when it started getting weird.
You can find the script that I've written at the end of the questions, I changed nothing on it and ran it in the same machine (Windows 10, GTX 1060 6GB, i7 10700K), pointing some environment variables to the correct CUDA version (10.1 or 11). So, okay, what's the problem, right? It occurs that the validation loss does not decrease while running the script using TensorFlow 2.0.0. I tried to check the documentation, if there were any breaking changes or something like this, after checking if the data and the seeds were the same. However, I wasn't able to find anything that helped me understand why the validation loss changed so much, the performance was degraded. For instance, running the same script, initializing the seed with the same value, using the two different versions, I got the following results:
TF 2.0.0
name
val_accuracy
val_loss
val_precision
val_recall
test_accuracy
test_loss
test_precision
test_recall
fold_5
80.2899%
0.442051
85.1852%
73.3333%
81.2022%
0.456165
93.0514%
67.3961%
fold_4
77.3913%
0.507347
85.6604%
65.7971%
83.4973%
0.412218
95.0000%
70.6783%
fold_3
75.9420%
0.517585
78.7781%
71.0145%
76.3934%
0.515756
81.6273%
68.0525%
fold_2
74.8191%
0.595039
95.2381%
52.1739%
74.7541%
0.521118
95.9350%
51.6411%
fold_1
73.0825%
0.652833
92.5532%
50.2890%
72.5683%
0.590006
96.3964%
46.8271%
TF 2.4.1
name
val_accuracy
val_loss
val_precision
val_recall
test_accuracy
test_loss
test_precision
test_recall
fold_5
96.6667%
0.096043
96.0000%
97.3913%
95.5191%
0.156118
92.2764%
99.3435%
fold_4
94.4928%
0.134758
92.5208%
96.8116%
96.3934%
0.145110
93.6214%
99.5624%
fold_3
96.0870%
0.094560
96.7647%
95.3623%
96.3934%
0.136126
94.1667%
98.9059%
fold_2
97.3951%
0.101784
97.6676%
97.1014%
96.6120%
0.131193
94.5607%
98.9059%
fold_1
96.3821%
0.095308
97.3451%
95.3757%
96.5027%
0.118284
95.1168%
98.0306%
It's possible to notice in the first plot that it seems that nothing is happening in the validation data. It seems that the model kind of learns something, because the training loss is decreasing, but "nothing" is happening with the validation loss. I honestly tried to search for a lot of things in order to explain this behavior, since it's exactly the same code but wasn't able to find anything useful. I do understand that, depending on the implementation, fluctuations might happen, but it seems to me that the scenario that I presented before is really specific.
Unfortunately, this is a no-go for my project because I do need to run the experiments in the other machine, which has CUDA 10, but I develop the whole application in the computer that I have access to, which is my personal one.
Given this scenario, can you, please, help me trying to understand what is going on? Am I missing any other change that I have to do in the script? Is this type of behavior expected? I would appreciate any help! It might be worth saying that for me to test it, I installed both CUDAs (10.1 and 11) on my computer and created two different conda environments so I can switch between them easily. It's also important to say that I've tried to use TF 2.2.0 and it works as expected, just like 2.4.1.
I'm sorry if I'm not in the right channel to ask this, and if it's not, please redirect me to the correct one. Thank you so much!
Below you can find my training script:
from pathlib import Path
import mlflow.tensorflow
from sklearn.model_selection import GroupKFold, StratifiedKFold
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from classes.Evaluator import Evaluator
from utils.args import get_args_train
from utils.callbacks import TrainingAndValidationMetrics
from utils.inputs import get_inputs_paths_and_targets
from utils.model import build_model, get_image_size, get_preprocess_function
from utils.seed import set_seeds
# Setting the random_state to get reproducible results
seed = set_seeds()
# Constants
ROOT_FOLDER = Path(__file__).resolve().parent.parent
FOLDS_NUMBER = 5
EPOCHS_NUMBER = 100
EPOCHS_WAITING_FOR_IMPROVEMENT = 5
# Gets which input we're going to use
args = get_args_train()
IMAGE_FOLDER_PATH = ROOT_FOLDER / f"crop_result_prop_{args.proportion}"
# Instantiates GroupKFold class to split into train and test
group_k_fold = GroupKFold(n_splits=FOLDS_NUMBER)
# Configuration dictionary that is going to be used to compile the model
config = {
"optimizer": "adam",
"loss": "binary_crossentropy",
"metrics": ["accuracy", Precision(), Recall()],
}
# Gets the dataframe that are going to be used in the flow_from_dataframe
# method from the ImageDataGenerator class
input_df = get_inputs_paths_and_targets(args.proportion)
# Using GroupKFold only to guarantee that a group (in this case, the slide)
# will contain data only in the train or the test group
for train_idx, test_idx in group_k_fold.split(
input_df.input, input_df.target, input_df.slide
):
train_data = input_df.iloc[train_idx]
test_data = input_df.iloc[test_idx]
# Break here is being used to get only the first fold
break
print("### Testing data distribution ###")
print(f"{test_data.groupby('slide').count()}")
generator_kwargs = {
"directory": IMAGE_FOLDER_PATH,
"x_col": "input",
"y_col": "target",
"seed": seed,
"target_size": get_image_size(args.model.lower()),
"classes": ["0", "1"],
}
idg = ImageDataGenerator(
fill_mode="nearest",
preprocessing_function=get_preprocess_function(args.model.lower()),
)
# Generator that will be used to evaluate the model
test_data_generator = idg.flow_from_dataframe(
test_data, class_mode="binary", shuffle=False, **generator_kwargs
)
# Callbacks
early_stopping = EarlyStopping(
monitor="val_loss", patience=EPOCHS_WAITING_FOR_IMPROVEMENT
)
callbacks = [early_stopping, TrainingAndValidationMetrics()]
# Starts run on mlflow to register metrics (experiment)
mlflow.start_run(
run_name=f"{args.model.lower()}",
tags={"data_proportion": args.proportion, "environment": "pads"},
)
current_fold = 1
kfold = StratifiedKFold(n_splits=FOLDS_NUMBER, shuffle=True, random_state=seed)
for train_idx, val_idx in kfold.split(train_data.input, train_data.target):
model = build_model(args.model.lower(), [Dense(1, activation="sigmoid")], config)
# Starts run on mlflow to register metrics (runs)
with mlflow.start_run(run_name=f"fold_{current_fold}", nested=True):
fitting_data = train_data.iloc[train_idx]
val_data = train_data.iloc[val_idx]
mlflow.log_text(
f"Training data: \n{fitting_data.groupby('target').count()} \n"
f"Validation data: \n{val_data.groupby('target').count()} \n"
f"Data proportion: {args.proportion} \n",
artifact_file="data_description.txt",
)
train_data_generator = idg.flow_from_dataframe(
fitting_data, class_mode="binary", **generator_kwargs
)
valid_data_generator = idg.flow_from_dataframe(
val_data, class_mode="binary", **generator_kwargs
)
training = model.fit(
train_data_generator,
epochs=EPOCHS_NUMBER,
validation_data=valid_data_generator,
callbacks=callbacks,
)
# Logging model
mlflow.keras.log_model(keras_model=model, artifact_path="model")
# Evaluating and logging
evaluator = Evaluator(model, training, test_data_generator, test_data.target)
test_metrics = evaluator.evaluate_model()
mlflow.log_metrics(test_metrics)
# Saving files to mlflow
mlflow.log_text(
evaluator.generate_classification_report(),
artifact_file="classification_report.txt",
)
mlflow.log_figure(
evaluator.generate_training_history_image(),
artifact_file="accuracy_loss_epochs.png",
)
mlflow.log_figure(evaluator.generate_roc_figure(), artifact_file="roc.png")
current_fold += 1
mlflow.end_run()

How to use evaluate and predict functions in keras implementation of SincNet?

thanks for your atention, I'm developing an automatic speaker recognition system using SincNet.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
Since the network is coded in Pytorch I searched and found a Keras implementation here https://github.com/grausof/keras-sincnet. I adapted the train.py code to train a Sincnet with my own data in Tensorflow 2.0, and worked fine, I saved only the weights of my trained network, my training data has shape 128,3200,1 for inputs and 128 for labels per batch
#Creates a Sincnet model with input_size=3200 (wlen), num_classes=40, fs=16000
redsinc = create_model(wlen,num_classes,fs)
#Saves only weights and stopearly callback
checkpointer = ModelCheckpoint(filepath='checkpoints/SincNetBiomex3.hdf5',verbose=1,
save_best_only=True, monitor='val_accuracy',save_weights_only=True)
stopearly = EarlyStopping(monitor='val_accuracy',patience=3,verbose=1)
callbacks = [checkpointer,stopearly]
# optimizer = RMSprop(lr=learnrate, rho=0.9, epsilon=1e-8)
optimizer = Adam(learning_rate=learnrate)
# Creates generator of training batches
train_generator = batchGenerator(batch_size,train_inputs,train_labels,wlen)
validinputs, validlabels = create_batches_rnd(validation_labels.shape[0],
validation_inputs,validation_labels,wlen)
#Compiling model and train with function fit_generator
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = redsinc.fit_generator(train_generator, steps_per_epoch=N_batches, epochs = epochs,
verbose = 1, callbacks=callbacks, validation_data=(validinputs,validlabels))
The problem came when I tried to evaluate the network, I didn't use the code found in test.py, I only loaded the weights I previously saved and use the function evaluate, my test data had the shape 1200,3200,1 for the inputs and 1200 for labels.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
test_loss, test_accuracy = redsinc.evaluate(x=eval_in,y=eval_lab)
RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer,
loss)`.
Then I added the same compile code I used for training:
optimizer = Adam(learning_rate=0.001)
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Then rerun the test code and got this:
WARNING:tensorflow:From C:\Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-
packages\tensorflow_core\python\ops\resource_variable_ops.py:1781: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is
deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
ValueError: A tf.Variable created inside your tf.function has been garbage-collected. Your code needs to keep Python references to variables created inside `tf.function`s.
A common way to raise this error is to create and return a variable only referenced inside your function:
#tf.function
def f():
v = tf.Variable(1.0)
return v
v = f() # Crashes with this error message!
The reason this crashes is that #tf.function annotated function returns a **`tf.Tensor`** with the **value** of the variable when the function is called rather than the variable instance itself. As such there is no code holding a reference to the `v` created inside the function and Python garbage collects it.
The simplest way to fix this issue is to create variables outside the function and capture them:
v = tf.Variable(1.0)
#tf.function
def f():
return v
f() # <tf.Tensor: ... numpy=1.>
v.assign_add(1.)
f() # <tf.Tensor: ... numpy=2.>
I don't understand the error since I've evaluated other networks with the same function and never got any problems. Then I decided to use predict function to match predicted labels with correct labels and obtain all metrics with my own code but I got another error.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
print('Model loaded')
#Predict labels with test data
predict_labels = redsinc.predict(eval_in)
Error while reading resource variable _AnonymousVar212 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar212/class tensorflow::Var does not exist.
[[node sinc_conv1d/concat_104/ReadVariableOp (defined at \Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_13649]
Function call stack:
keras_scratch_graph
I hope someone can tell me what these errors mean and how to solve them, I've searched for solutions to them but most of the solutions I've found don't seem related to my problem so I can't apply those solutions. I'm guessing the errors are caused by the Sincnet layer code, because it is a custom coded layer. The code for Sincnet layer can be found in the github repository in the file sincnet.py.
I appreciate all help I can get, again thank you for your atention.

You should downgrade your tf and keras version, it works to me when I faced the same problem.
Try this keras==2.1.6; tensorflow-gpu==1.13.1

TensorFlow How to Initialize Global Step

So I'm trying to run a training session, and when I do I get this error when trying to run my algorithm (when I use tf.train.get_global_step()):
ValueError: global_step is required for exponential_decay.
For some reason, tf.train.get_or_create_global_step() doesn't exist for me, I'm not sure if that's because it's a removed method or what. I updated TensorFlow and everything I'm up to date.
I've dug around the documentation and there's nothing about it. To run I'm using tf.app.run() with a main function.
Is there another way to initialize the global step variable?

Although tf.train.get_or_create_step() is perfectly fine, here is another solution:
g_step = tf.get_variable('global_step', trainable=False, initializer=0)
learning_rate = tf.train.exponential_decay(0.1, g_step)
tf.train.AdamOptimizer(learning_rate).minimize(loss=loss, global_step=g_step)
Create an untrainable variable that initializes with zero and passes it to the Optimizer.
If you need global_step later use tf.train.global_step():
sess = tf.Session()
# Initialize the variable
sess.run(g_step.initializer)
print('global_step: %s' % tf.train.global_step(sess, g_step))

So, the reason this function wasn't showing up was because I actually hadn't been on the newest version of TensorFlow even though it was telling me I was completely up to date.
Seen Here:
So all I did to fix it was uninstall tensorflow, then install from the actual link I don't have it anymore, but a quick google search would suffice.

Tensorflow - Using tf.summary with 1.2 Estimator API

I'm trying to add some TensorBoard logging to a model which uses the new tf.estimator API.
I have a hook set up like so:
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
summary_op=tf.summary.merge_all())
# ...
classifier.train(
input_fn,
steps=1000,
hooks=[summary_hook])
In my model_fn, I am also creating a summary -
def model_fn(features, labels, mode):
# ... model stuff, calculate the value of loss
tf.summary.scalar("loss", loss)
# ...
However, when I run this code, I get the following error from the summary_hook:
Exactly one of scaffold or summary_op must be provided. This is probably because tf.summary.merge_all() is not finding any summaries and is returning None, despite the tf.summary.scalar I declared in the model_fn.
Any ideas why this wouldn't be working?

Use tf.train.Scaffold() and pass tf.merge_all as following
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
scaffold=tf.train.Scaffold(summary_op=tf.summary.merge_all()))

Just for whoever have this question in the future, the selected solution doesn't work for me (see my comments in the selected solution).
Actually, with TF 1.2 Estimator API, one doesn't need to have summary_hook. I just have tf.summary.scalar("loss", loss) in the model_fn, and run the code without summary_hook. The loss is recorded and shown in the tensorboard. I'm not sure if TF API was changed after this and similar questions.

with Tensorflow ver-r1.3
Add your summary ops in your estimator model_fn
example :
tf.summary.histogram(tensorOp.name, tensorOp)
If you feel writing summaries may consume time and space, you can control the writing frequency of summaries, in your Estimator run_config
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir=FLAGS.model_dir)
run_config = run_config.replace(save_summary_steps=150)
Note: this will affect the overall summary writer frequency for TensorBoard logging, of your estimator (tf.estimator.Estimator)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to debug where torch Adam optimiser is going wrong - python

Related

No console output using Keras model.fit() function

Different training results on TF 2.4.1/2.2.0 and 2.0.0 with the same script

How to use evaluate and predict functions in keras implementation of SincNet?

TensorFlow How to Initialize Global Step

Tensorflow - Using tf.summary with 1.2 Estimator API

Categories

Resources