Reconstruction of training path from saved checkpoint

Reconstruction of training path from saved checkpoint - python

I'm training a simple fully convolutional network with batchnorm. I saved a checkpoint after the initialization. Then, I restored it and ran the training again (with the same hyperparameters). However, I got different results from the two training procedures. All my seeds (python, numpy, and tf) were set equal on the beginning of the two runs.
What could possibly be the reason for the mismatch?

Setting seeds in the file header will lead to different outcomes, since initialization will consume some of the random values before you get to training.
So you should set the seeds after performing initialization. The initialization can use any seed you want, including another fixed seed, but then you have to reset it again for training.
Here is some very high level pseudo-code, which assumes you have functions to to build, initialize, and train the model. And also functions to save and load the checkpoints.
def set_seeds(seed):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
# Make a checkpoint.
set_seeds(INIT_SEED)
model = build_model()
model = initialize_model(model)
save_checkpoint(model)
# Or load a checkpoint.
model = load_checkpoint()
# At this point seeds need to be identical.
# So re-fix the seeds for training, then proceed.
set_seeds(TRAIN_SEED)
trained = train_model(model)

Related

Increasing number of classes on TensorFlow checkpoint

For a project I'm working on I have to compare the performance of a small, simulated dataset to the performance of well-known datasets (e.g. ImageNet, CIFAR-10) using two neural networks, ResNet v2 at different depths and Inception v3.
To account for the imbalance between my set and the big popular ones, one of the methods I want to employ is augmenting the big sets using either a class from my simulated set or the same class but instead with real-life images. Then, using a pre-trained model, I want to train on both augmented sets and later compare performance.
The issue
The augmentation works fine: I copy over the .tfrecord files containing only images of the extra class to the original dataset and I modify the labels.txt file to contain the appropriate extra labels. However, the issue occurs when trying to train on this augmented dataset:
I0125 16:40:31.279634 140094914221888 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [258] rhs shape= [257]
which is to be expected, as the graph cannot handle the extra class.
My attempted solution
To alleviate this issue I tried applying the same method as was posed in this answer, using the Python script below (modified for brevity):
checkpoint = {}
saver = tf.compat.v1.train.import_meta_graph(f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}.meta")
with tf.Session() as sess:
saver.restore(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
# get appropriate nodes where number of classes is one of its dimensions
if "inception_v3" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
elif "resnet_v2" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
# select the nodes previously defined
nodes = [node for node in tf.global_variables() if node.name in target_nodes]
for node in nodes:
# load the values of those nodes from the checkpoint
checkpoint[node.name] = node.eval()
# extend dimensions for extra class
checkpoint.update({node.name: np.insert(checkpoint[node.name], checkpoint[node.name].shape[-1], 0.0, axis=-1)})
# assign new nodes to the checkpoint
sess.run(tf.assign(node, checkpoint[node.name], validate_shape=False))
saver.save(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
In short, this script loads the checkpoint and isolates all nodes where one of the dimensions is equal to the number of classes and increases it. I have verified these are always equal to the number of classes using the checkpoints from the different datasets the models have been trained on. In fact, it seems that those tensors are the only thing that's different between the checkpoints.
Although this method does execute correctly and allows me to run the training, it feels very boneheaded and I'm almost positive it is not at all what I'm supposed to do. However I'm curious as to what exactly is wrong with this (apart from initializing the new values as 0.0), as except from a wild fluctuation in loss at the start of the tranining it seems to work as I had hoped.
Other solutions
When looking further into this issue I also found other users' answers on similar questions suggesting that it is in fact impossible to modify a checkpoint in the way that I want to and suggested transfer learning or modifying the output layer. I am very new to neural network training and I don't know who to believe, as the linked answer seems to suggest that what I'm trying to do should work.
My question
So I ask: which approach is correct? Could my initial approach work, or should I try a different method to fix this issue? Starting from scratch would not be ideal, as the progress so far has taken over a month of continuous training.
I am using a modified version the TensorFlow Model Garden as my environment to train the networks, with the modificiations pertaining to creating custom datasets and using them using the supplied training scripts.

Every run (using pytorch+cuda) is different w/ batch norming, even when RNGs are seeded

Why does my model with batch norm layer behave differently in every running, when the model without batch norm layer performs the same. In my model, random seed has been set by:
np.random.seed(args.seed)
torch.manual_seed(args.seed)
random.seed(args.seed)
torch.cuda.manual_seed_all(args.seed)
os.environ['PYTHONHASHSEED'] = str(args.seed)
After removing the batch norm layer and maintaining other settings, my model produces the same results.

It appears that you need to also include the line:
torch.backends.cudnn.deterministic = True
In order to force the cuda portion of the algorithm to be deterministic in batch norm for PyTorch. Here are more details.

Keras + TensorFlow Model

I'm currently creating a model and while creating it I came with some questions. Does training the same model with the same data multiple times leads to better precision of those objects, since your training it every time? And what could be the issue when sometimes the object gets 90% precision and when I re-run it gets lower precision or even not predicting the right object? Is it because of Tensorflow running on the GPU?

I will guess that you are doing image recognition and that you want to identify images (objects) using a neuronal network made with Keras. You should train it once, but during training you will do several epochs, meaning the algorithm adapts the weights in several rounds (epochs). For each round it goes over all training images. Once trained, you can use the model to identify images/objects.
You can evaluate the accuracy of your trained model over the same training set, but it is better to use a different set (see train_test_split from sklearn for instance).
Training is a stochastic process, meaning that every time you train your network it will be different in the end. Hence, you will get different accurcies. The stochasticity comes from different initial weights or from using stochastic gradient descent methods for instance.
The question does not appear to have anything to do with Keras or TensorFlow but basic understandting of how neuronal networks work. There is no connection to running Tensorflow on the GPU. You will also not get better precision by training with the same objects. If you train your model on a dataset for a very long time (many epochs), you might get into overfitting. Then, the accuracy of your model on this training dataset will be very high, but the model will have low accuracy on other datasets.

A common technique is split your date in train and validation datasets, then repeatedly train your model using EarlyStopping. This will train on the training dataset, then calculate the loss against the validation dataset, and then keep training until no further improvement is seen. You can set a patience parameter to wait for X epochs without an improvement to stop training (and optionally save the best model)
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Another trick is image augmentation with ImageDataGenerator which will generate synthetic data for you (rotations, shifts, mirror images, brightness adjusts, noise etc). This can effectively increase the amount of data you have to train with, thus reducing overfitting.
https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/

I am trying to understand 'epochs' in neural network training. Are the next experiments equivalent?

Lets say I have a training sample (with their corresponding training labels) for a defined neural network (the architecture of the neural network does not matter for answering this question). Lets call the neural network 'model'.
In order to not create any missunderstandings, lets say that I introduce the initial weights and biases for 'model'.
Experiment 1.
I use the training sample and the training labels to train the 'model' for 40 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB_Final_experiment1.
Experiment 2
I use the training sample and the training labels to train 'model' for 20 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB_Intermediate.
Now I introduce WB_Intermediate in 'model' and train for another 20 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB__Final_experiment2.
Considerations. Every single parameter, hyperparameter, activation functions, loss functions....is exactly the same for both experiments, except the epochs.
Question: Are WB_Final_experiment1 and WB__Final_experiment2 exactly the same?

If you follow this tutorial here, you will find the results of the two experiments as given below -
Experiment 1
Experiment 2
In the first experiment the model ran for 4 epochs and in the second experiment, the model ran for 2 epochs and then trained for 2 more epochs using last weights of previous training. You will find that the results vary but to a very small amount. And they will always vary due to the randomized initialization of weights. But the prediction of both models will lie very near to each other.
If the models are initialized with the same weights then the results at the end of 4 epochs for both the models will remain same.
On the other hand if you trained for 2 epochs, then shut down your training session and the weights are not saved and if you train now for 2 epochs after restarting session, the prediction won't be the same. To avoid that before training, always load the saved weights to continue training using model.load_weights("path to model").
TL;DR
If models are initialized with the exact same weights, then the output at the end of same training epochs will remain same. If they are randomly initialized the output will only vary slightly.

If the operations you are doing are entirely deterministic, then yes. Epochs are implemented as an iteration number for a for loop around your training algorithm. You can see this in implementations in PyTorch.

Typically no, the model weights will not be the same as the optimisation will accrue its own values during training. You will need to save those too to truly resume from where you left off.
See the Pytorch documentation regarding saving and resuming here. But this concept is not limited to the Pytorch framework.
Specifically:
It is important to also save the optimizer’s state_dict, as this
contains buffers and parameters that are updated as the model trains.

TensorFlow FullyConnected Tutorial: How are the trained weights used for Eval and Test?

I've been looking through the TensorFlow FullyConnected tutorial. This also uses the helper code mnist.py
I understand the code but for one nagging piece. After training the Neural Net, the weights obtained from training should be used to evaluate the precision of the model on the Validation (and Test) data. However, I don't see that being done anywhere.
Infact, this is the only thing I see in fully_connected_feed.py
# Evaluate against the validation set.
print('Validation Data Eval:')
do_eval(sess,
eval_correct,
images_placeholder,
labels_placeholder,
data_sets.validation)
# Evaluate against the test set.
print('Test Data Eval:')
do_eval(sess,
eval_correct,
images_placeholder,
labels_placeholder,
data_sets.test)
the do_eval() function seems to be passed a parameter eval_correct which seems to be recalculating the logits again on this new data. I've been playing around with TF for a while now but I'm baffled by this code. Any thoughts would be great.

TensorFlow creates a graph with the weights and biases. Roughly speaking while you train this neural net the weights and biases get changed so it produces expected outputs. The line 131 in fully_connected_feed.py (with tf.Graph().as_default():) is used to tell TensorFlow to use the default graph. Therefore every line in the training loop including the calls of the do_eval() function use the default graph. Since the weights obtained from training are not resetted before evaluation they are used for it.
eval_correct is the operation used instead of the training operation to just evaluate the neural net without training it. This is important because otherwise the neural net would be trained to them which would result in distorted (too good) results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reconstruction of training path from saved checkpoint - python

Related

Increasing number of classes on TensorFlow checkpoint

Every run (using pytorch+cuda) is different w/ batch norming, even when RNGs are seeded

Keras + TensorFlow Model

I am trying to understand 'epochs' in neural network training. Are the next experiments equivalent?

TensorFlow FullyConnected Tutorial: How are the trained weights used for Eval and Test?

Categories

Resources