best_state changes with the model during training in pytorch - python

I want to save the best model and then load it during the test. So I used the following method:
def train():
#training steps …
if acc > best_acc:
best_state = model.state_dict()
best_acc = acc
return best_state
Then, in the main function I used:
model.load_state_dict(best_state)
to resume the model.
However, I found that best_state is always the same as the last state during training, not the best state. Is anyone know the reason and how to avoid it?
By the way, I know I can use torch.save(the_model.state_dict(), PATH) and then load the model by
the_model.load_state_dict(torch.load(PATH)).
However, I don’t want to save the parameters to file as train and test functions are in one file.

model.state_dict() is OrderedDict
from collections import OrderedDict
You can use:
from copy import deepcopy
To fix the problem
Instead:
best_state = model.state_dict()
You should use:
best_state = copy.deepcopy(model.state_dict())
Deep (not shallow) copy makes the mutable OrderedDict instance not to mutate best_state as it goes.
You may check my other answer on saving the state dict in PyTorch.

When you are saving the state of the model you should save the following things in the network
1) Optimizer state and
2) Model's state dict
You can define one method in your class model as following
def save_state(state,filename):
torch.save(state,filename)
'''
When you are saving the state do as follows:
'''
Model model //for example
model.save_state({'state_dict':model.state_dict(), 'optimizer': optimizer.state_dict()})
The saved model will be stored as model.pth.tar (for an example)
Now during loading do the following steps,
checkpoint = torch.load('model.pth.tar')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
Hope this will help you.

Related

optimizer.step() Not updating Model Weights/Parameters

I'm currently working on a solution via PyTorch. I'm not going to share the exact solution but I will provide code that reproduces the issue I'm having.
I have a model defined as follows:
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
self.fc1 = nn.Linear(10,4)
def foward(self,x):
return nn.functional.relu(self.fc1(x))
Then I create a instance: my_model = Net(). Next I create an Adam optimizer as such:
optim = Adam(my_model.parameters())
# create a random input
inputs = torch.tensor(np.array([1,1,1,1,1,2,2,2,2,2]),dtype=torch.float32,requires_grad=True)
# get the outputs
outputs = my_model(inputs)
# compute gradients / backprop via
outputs.backward(gradient=torch.tensor([1.,1.,1.,5.]))
# store parameters before optimizer step
before_step = list(my_model.parameters())[0].detach().numpy()
# update parameters via
optim.step()
# collect parameters again
after_step = list(my_model.parameters())[0].detach().numpy()
# Print if parameters are the same or not
print(np.array_equal(before_step,after_step)) # Prints True
I provided my models parameters to the Adam optimizer, so I'm not exactly sure why the parameters aren't updating. I know in most cases one uses a loss function, however I cannot do that in my case but I assumed if I specified model paramters to the optimizers, it would know to connect the two.
Anyone know why the parameters aren't getting updated?
The problem is with detach (docs).
As noted at the bottom:
Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks
So that is exactly what's happening here. To correctly compare the parameters, you need to clone (docs) them to get a real copy.
list(my_model.parameters())[0].clone().detach().numpy()
On a side note, it can be helpful if you check the gradients after optim.step() with print(list(my_model.parameters())[0].grad) to check if the graph is intact. Also, don't forget to call optim.zero_grad().

What does model.compile() do in keras tensorflow?

According to keras.io:
Once the model is created, you can config the model with losses and
metrics with model.compile().
But this explanation does not provide enough information about what exactly compiling model does.
Configures the model for training. documentation
Personally, I wouldn't call it compile, because what it does has got nothing to do with compilation, in computer science terms, and this is very confusing/ overwhelming to think about machine learning and compilation at the same time.
Its just a method which does configuration:
It just sets the arguments you pass it: optimizer, loss function, metrics, eager execution. You can run it multiple times, it will just overwrite the settings you set previously.
My suggestion to developers of TensorFlow would be to rename it to configure in the short term, and perhaps in the future (not that important), move to having 1 setter (or use the factory/ builder pattern) for each configuration argument.
Heres the code for it:
base_layer.keras_api_gauge.get_cell('compile').set(True)
with self.distribute_strategy.scope():
if 'experimental_steps_per_execution' in kwargs:
logging.warn('The argument `steps_per_execution` is no longer '
'experimental. Pass `steps_per_execution` instead of '
'`experimental_steps_per_execution`.')
if not steps_per_execution:
steps_per_execution = kwargs.pop('experimental_steps_per_execution')
self._validate_compile(optimizer, metrics, **kwargs)
self._run_eagerly = run_eagerly
self.optimizer = self._get_optimizer(optimizer)
self.compiled_loss = compile_utils.LossesContainer(
loss, loss_weights, output_names=self.output_names)
self.compiled_metrics = compile_utils.MetricsContainer(
metrics, weighted_metrics, output_names=self.output_names)
self._configure_steps_per_execution(steps_per_execution or 1)
# Initializes attrs that are reset each time `compile` is called.
self._reset_compile_cache()
self._is_compiled = True
self.loss = loss or {} # Backwards compat.
model.compile is related to training your model. Actually, your weights need to optimize and this function can optimize them. In a way that your accuracy make increases. This was just one of the input parameters called 'optimizer'.
model.compile(
optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics='acc'
)
These are the main inputs. Also you can find more details in TensorFlow documentation in link below:
https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile

The result is different when I apply torch.manual_seed before loading cuda() after loading the model

I tried to make sure my code to be reproducible (always get the same results)
So I applied below settings before my codes.
os.environ['PYTHONHASHSEED'] = str(args.seed)
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
torch.cuda.manual_seed_all(args.seed) # if you are using multi-GPU.
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
With these settings, I always achieved the same results with the same environment and GPU.
Howerver, when I applied torch.manual_seed() after loading the model.
torch.manual_seed(args.seed)
model = Net()
Net.cuda()
torch.manual_seed(args.seed)
model = Net()
torch.manual_seed(args.seed)
Net.cuda()
The above two results were different.
How should I understand this situation?
Does seed reinitialize after loading the model?
The Net.cuda() has no effect on the random number generator. Under the hood it just calls cuda() for all of the model parameters. So basically it's multiple calls to Tensor.cuda().
https://github.com/pytorch/pytorch/blob/ecd3c252b4da3056797f8a505c9ebe8d68db55c4/torch/nn/modules/module.py#L293
We can test this by doing the following:
torch.random.manual_seed(42)
x = torch.rand(1)
x.cuda()
y = torch.rand(1)
y.cuda()
print(x, y)
# the above prints the same as below
torch.random.manual_seed(42)
print(torch.rand(1), torch.rand(1))
So that means Net() is using the number generator to initialize random weights within the layers.
torch.manual_seed(args.seed)
model = Net()
print(torch.rand(1))
# the below will print a different result
torch.manual_seed(args.seed)
model = Net()
torch.manual_seed(args.seed)
print(torch.rand(1))
I would recommend narrowing the scope of how random numbers are managed within your Python source code. So that a global block of code outside of the Model isn't responsible for how internal values are generated.
Simply said, pass the seed as a parameter to the __init__ of the model.
model = Net(args.seed)
print(torch.rand(1))
This will force developers to always provide a seed for consistency when using the model, and you can make the parameter optional if seeding isn't always necessary.
I'd avoid using the same seed all the time, because you're going to learn to use parameters that work best with that seed.

Keras: How to save models or weights?

I am sorry if this question seems pretty straight forward. But reading the Keras save and restore help page :
https://www.tensorflow.org/beta/tutorials/keras/save_and_restore_models
I do not understand how to use the "ModelCheckpoint" for saving during training. The help file mentions it should give 3 files, I see only one, MODEL.ckpt.
Here is my code:
checkpoint_dir = FolderName + "/tmp/model.ckpt"
cp_callback = k.callbacks.ModelCheckpoint(checkpoint_dir,verbose=1,save_weights_only=True)
parallel_model.compile(optimizer=tf.keras.optimizers.Adam(lr=learning_rate),loss=my_cost_MSE, metrics=['accuracy])
parallel _model.fit(image, annotation, epochs=epoch,
batch_size=batch_size, steps_per_epoch=10,
validation_data=(image_val,annotation_val),validation_steps=num_batch_val,callbacks=callbacks_list)
Also, when I want to load the weights after training with:
model = k.models.load_model(file_checkpoint)
I get the error:
"raise ValueError('Unknown ' + printable_module_name + ':' + object_name)
ValueError: Unknown loss function:my_cost_MSE"
my-cost_MSE is my cost function that is used in the training.
First of all, it looks like you are using the tf.keras (from tensorflow) implementation rather than keras (from the keras-team/keras repo). In this case, as stated in the tf.keras guide :
When saving a model's weights, tf.keras defaults to the checkpoint
format. Pass save_format='h5' to use HDF5.
On the other hand, note that adding the callback ModelCheckpoint is, usually, roughly equivalent to call model.save(...) at the end of each epoch, so that's why you should expect three files to be saved (according to the checkpoint format).
The reason it's not doing so is because, by using the option save_weights_only=True, you are saving just the weights. Roughly equivalent to replace the call to model.save for model.save_weights at the end of each epoch. Hence, the only file that's being saved is the one with the weights.
From here, you can proceed in two different ways:
Storing just the weights
You need your model (the structure, let's say) to be loaded beforehand and then call model.load_weights instead of keras.models.load_model:
model = MyModel(...) # Your model definition as used in training
model.load_weights(file_checkpoint)
Note that in this case, you won't have problems with custom definitions (my_cost_MSE) since you are just loading model weights.
Storing the whole model
Another way to proceed is to store the whole model and load it accordingly:
cp_callback = k.callbacks.ModelCheckpoint(
checkpoint_dir,verbose=1,
save_weights_only=False
)
parallel_model.compile(
optimizer=tf.keras.optimizers.Adam(lr=learning_rate),
loss=my_cost_MSE,
metrics=['accuracy']
)
model.fit(..., callbacks=[cp_callback])
Then you could load it by:
model = k.models.load_model(file_checkpoint, custom_objects={"my_cost_MSE": my_cost_MSE})
Note that in this latter case, you need to specify custom_objects since its definition is needed to deserialize the model.
keras has a save command. It saves all the details needed to rebuild the model.
(from the keras docs)
from keras.models import load_model
model.save('my_model.h5') # creates a HDF5 file 'my_model.h5'
del model # deletes the existing model
# returns am identical compiled model
model = load_model('my_model.h5')

How do I save/load a tensorflow contrib.learn regressor?

I have a tensorflow contrib.learn.DNNRegressor that I have trained as part of the following code snippet:
regressor = tf.contrib.learn.DNNRegressor(feature_columns=fc,
hidden_units=hu_array,
optimizer=tf.train.AdamOptimizer(
learning_rate=0.001,
),
enable_centered_bias=False,
activation_fn=tf.tanh,
model_dir="./models/my_model/",
)
regressor.fit(x=training_features, y=training_labels, steps=10000)
The trained network performs quite well, and I'd like to use it as a part of some other code, on another machine. I have tried copying over the models/my_model directory, and constructing a new DNNRegressor pointing just at the model_dir, but it requires that I supply feature_columns and hidden_units definitions. Shouldn't that information be available via the snapshots stored in model_dir? Is there a better way to save/recover a trained model which is performing well, to be used as a predictor, without having to separately save the feature_columns and hidden_units?
I came up with something workable- not ideal, but it gets the job done. If anyone has a better idea, I am all ears.
I converted my kwargs for DNNRegressor into a dict, and used the ** operator. Then I was able to pickle the kwargs dict, and reconstruct the DNNRegressor from that. E.g:
reg_args = {'feature_columns': fc, 'hidden_units': hu_array, ...}
regressor = tf.contrib.learn.DNNRegressor(**reg_args)
pickle.dump(reg_args, open('reg_args.pkl', 'wb'))
Later on, I reconstruct via:
reg_args = pickle.load(open('reg_args.pkl', 'rb'))
# On another machine and so my model dir path changed:
reg_args['model_dir'] = NEW_MODEL_DIR
regressor = tf.contrib.learn.DNNRegressor(**reg_args)
It worked well. I'm sure there must be a better way but for now if someone is trying to figure out a workaround for tf.contrib.learn, this is a solution.
When training
You call DNNRegressor(..., model_dir) and then call the fit() and evaluate() method.
When testing
You call DNNRegressor(..., model_dir) and then can call predict() methods. Your model will find a trained model in the model_dir and will load the trained model params.
Reference
Issue #3340 of TF

Categories

Resources