Unexpected key(s) in state_dict: "model", "opt" - python

I'm currently using fast.ai to train an image classifier model.
data = ImageDataBunch.single_from_classes(path, classes, ds_tfms=get_transforms(), size=224).normalize(imagenet_stats)
learner = cnn_learner(data, models.resnet34)
learner.model.load_state_dict(
torch.load('stage-2.pth', map_location="cpu")
)
which results in :
torch.load('stage-2.pth', map_location="cpu") File
"/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Sequential:
...
Unexpected key(s) in state_dict: "model", "opt".
I have looked around in SO and tried to use the following solution:
# original saved file with DataParallel
state_dict = torch.load('stage-2.pth', map_location="cpu")
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
learner.model.load_state_dict(new_state_dict)
which results in :
RuntimeError: Error(s) in loading state_dict for Sequential:
Unexpected key(s) in state_dict: "".
I'm using Google Colab to train my model and then port the trained model into docker and try to host in in a local server.
What could be the issue? Could it be the different version of pytorch which results in model mismatch?
In my docker config:
# Install pytorch and fastai
RUN pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
RUN pip install fastai
While my Colab is using the following:
!curl -s https://course.fast.ai/setup/colab | bash

My strong guess is that stage-2.pth contains two top-level items: the model itself (its weights) and the final state of the optimizer which was used to train it. To load just the model, you need only the former. Assuming things were done in the idiomatic PyTorch way, I would try
learner.model.load_state_dict(
torch.load('stage-2.pth', map_location="cpu")['model']
)
Update: after applying my first round of advice it becomes clear that you're loading a savepoint create with a different (perhaps differently configured?) model than the one you're loading it into. As you can see in the pastebin, the savepoint contains weights for some extra layers, not present in your model, such as bn3, downsample, etc.
"0.4.0.bn3.running_var", "0.4.0.bn3.num_batches_tracked", "0.4.0.downsample.0.weight"
at the same time some other key names match, but the tensors are of different shapes.
size mismatch for 0.5.0.downsample.0.weight: copying a param with shape torch.Size([512, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 1, 1]).
I see a pattern that you consistently try to load a parameter of shape [2^(x+1), 2^x, 1, 1] in place of [2^(x), 2^(x-1), 1, 1]. Perhaps you're trying to load a model of different depth (ex. loading vgg-16 weights for vgg-11?). Either way, you need to figure out the exact architecture used to create your savepoint and then recreate it before loading the savepoint.
PS. In case you weren't sure - savepoints contain model weights, along with their shapes and (autogenerated) names. They do not contain the full specification of the architecture itself - you need to assure yourself, that you're calling model.load_state_dict with model being of exactly the same architecture as was used to create the savepoint. Otherwise you will likely have weight names mismatching.

Related

weights = 'noisy-student' ValueError: The `weights` argument should be either `None`, `imagenet`, or the path to the weights file to be loaded

I am using Google Colab and I want to use the weights of EfficientNet Noisy Student. https://www.kaggle.com/c/bengaliai-cv19/discussion/132894
First, I installed the package via:
!pip install git+https://github.com/qubvel/efficientnet
Then I tried the code found on the site mentioned above:
import efficientnet.keras as eff
model = eff.EfficientNetB0(weights='noisy-student')
And got this Value error:
ValueError: The `weights` argument should be either `None` (random initialization), `imagenet` (pre-training on ImageNet), or the path to the weights file to be loaded.
Does someone know how to fix this?
You could download the weights from here.
And load it manually like this:
path_to_weights = "/..your..path../efficientnet-b5_noisy-student_notop.h5"
model = EfficientNetB5(include_top=False)
model.load_weights(path_to_weights, by_name=True)

Layer not built error, even after model.build() in tensorflow 2.0.0

Reference I was following:
https://www.tensorflow.org/api_docs/python/tf/keras/Model#save
I really want to run the model; give it some inputs; grab some layer outputs coming from inside the model.
model = tf.keras.models.load_model('emb_movielens100k_all_cols_dec122019')
input_shape = (None, 10)
model.build(input_shape)
All good so far; no errors no warnings.
model.summary()
ValueError: You tried to call `count_params` on IL, but the layer isn't built. You can build it manually via: `IL.build(batch_input_shape)`
How to fix?
Following code does not fix it:
IL.build(input_shape) # no
model.layer-0.build(input_shape) # no
This seems to work: But it's a long way from my goal of running the model and grabbing some layer outputs. Isn't there an easy way in TF 2.0.0?
layer1 = model.get_layer(index=1)
This throws an error:
model = tf.saved_model.load('emb_movielens100k_all_cols_dec122019')
input_shape = (None, 10)
model.build(input_shape) #AttributeError: '_UserObject' object has no attribute 'build'
The fix was to use save_model(), not model.save(). Also needed to use save_format="h5" during save, not default format. Like this:
tf.keras.models.save_model(model, "h5_emb.hp5", save_format="h5")
Also needed to use model_load(), not saved_model.load(), to load to memory from disk. Like this:
model = tf.keras.models.load_model('h5_emb.hp5')
The other tutorial and documentation ways of doing save and load returned a model that did not work right for predictions or summary.
This is tensorflow version 2.0.0.
Hope this helps others.

Incorrect results obtained on running a model in LibTorch that was trained and exported from PyTorch

I am trying to export a trained model along with weights for inference in C++ using LibTorch. However, the output tensor results do not match.
The shape of the output tensor is the same.
model = FCN()
state_dict = torch.load('/content/gdrive/My Drive/model/trained_model.pth')
model.load_state_dict(state_dict)
example = torch.randn(1, 3, 768, 1024)
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save('/content/gdrive/My Drive/model/mymodel.pt')
However some warnings are generated which I think maybe causing the incorrect results to be generated.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:137:
TracerWarning: Converting a tensor to a Python index might cause the
trace to be incorrect. We can't record the data flow of Python values,
so this value will be treated as a constant in the future. This means
that the trace might not generalize to other inputs!
/usr/local/lib/python3.6/dist-packages/torch/tensor.py:435:
RuntimeWarning: Iterating over a tensor might cause the trace to be
incorrect. Passing a tensor of different shape won't change the number
of iterations executed (and might lead to errors or silently give
incorrect results).'incorrect results).', category=RuntimeWarning)
Following is the LibTorch code to generate the output tensor
at::Tensor predict(std::shared_ptr<torch::jit::script::Module> model, at::Tensor &image_tensor) {
std::vector<torch::jit::IValue> inputs;
inputs.push_back(image_tensor);
at::Tensor result = model->forward(inputs).toTensor();
return result;
}
Has anyone tried using a trained PyTorch model in LibTorch?
Just ran into the same issue, and found a solution:
add
model.eval()
before
traced_script_module = torch.jit.trace(model, example)
and the model gives the same result in c++ as in python

Computing gradients in extracted Tensorflow subgraph which contains a 'while_loop'

In some deep learning workflows, it is useful to train a model, extract it out of its graph using tf.graph_util.convert_variables_to_constants or tf.graph_util.extract_sub_graph so training-related tensors are left out, and then connect the extracted subgraph to other model(s) via tf.import_graph_def. In this way, the trained model can serve as a building block in a larger setup.
Often, we'd like to backpropagate through the new, composite model, in order to fine-tune it, optimize the inputs and so on.
However, it appears that one cannot define a gradient through a while_loop tensorflow operation in an imported graph, since it relies on 'outer context', an object added into the metagraph's collections (see TF issue #7404). Slightly adapting the example in this Github issue, here's an example of what I am trying to do:
import tensorflow as tf
g1=tf.Graph()
sess1=tf.Session(graph=g1)
with g1.as_default():
with sess1.as_default():
i=tf.constant(0, name="input")
out=tf.while_loop(lambda i: tf.less(i,5), lambda i: [tf.add(i,1)], [i], name="output")
loss=tf.square(out,name='loss')
graph_def = tf.graph_util.convert_variables_to_constants(sess1,g1.as_graph_def(),['output/Exit'])
g2 = tf.Graph()
with g2.as_default():
tf.import_graph_def(graph_def,name='')
i_imported = g2.get_tensor_by_name("input:0")
out_imported = g2.get_tensor_by_name("output/Exit:0")
tf.gradients(out_imported, i_imported)
The last line raises an AttributeError: 'NoneType' object has no attribute 'outer_context' error.
Tensorflow's solution to this issue is to use tf.train.export_meta_graph and tf.train.import_meta_graph so the outer context is copied, but this copies the entire graph, without editting. In this minimal case, the 'loss' tensor won't be removed.
I tried copying the missing context to the new graph:
g2.add_to_collection('while_context',g1.get_collection('while_context'))
But it doesn't solve the issue.
Is there a way to overcome this limitation or is it an irreparable Tensorflow design flaw?

MxNet: label_shapes don't match names specified by label_names

I wrote a script to do the classification of a single input image using a model I trained with MxNet. To classify the incoming image I feedforward them in through network.
In short here is what I am doing:
symbol, arg_params, aux_params = mx.model.load_checkpoint('model-prefix', 42)
model = mx.mod.Module(symbol=symbol, context=mx.cpu())
model.bind(data_shapes=[('data', (1, 3, 224, 244))], for_training=False)
model.set_params(arg_params, aux_params)
# ... loading the image & resizing ...
# img is the image to classify as numpy array of shape (3, 244, 244)
Batch = namedtuple('Batch', ['data'])
self._model.forward(Batch(data=[mx.nd.array(img)]))
probabilities = self._model.get_outputs()[0].asnumpy()
print(str(probabilities))
This works fine, except that I am getting the following warning
UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
What should I change to avoid getting this warning? It is not clear to me what the label_shapes and label_names parameters are meant for, and what I am expect to fill them with.
Note: I found some thread about them, but none enabled me to solve the problem. Similarly the MxNet documentation doesn't provide much details on what those parameters are and on how they are supposed to be filled.
Set label_names=None and allow_missing=True. That should get rid of the warning.
model = mx.mod.Module(symbol=symbol, context=mx.cpu(), label_names=None)
...
model.set_params(arg_params, aux_params, allow_missing=True)
If you are curious why the warning is printed in the first place,
Every module has associated label. When this model was trained, softmax_label was used as the label (most likely because the output layer was a softmax layer named 'softmax'). When the model was loaded from file, the module that was created had softmax_label as the module's label.
>>>print(model.label_names)
['softmax_label']
model.bind is then called without providing label_shapes.
model.bind(data_shapes=[('data', (1, 3, 224, 244))], for_training=False)
MXNet sees that the module has a label in it which was not provided during bind and complains about it - which is the warning message you see.
I think if bind is called with for_training=False, MXNet shouldn't complain about the missing label. I've created this issue: https://github.com/dmlc/mxnet/issues/6958
However, for this particular case where we load a model from disk, we can load it with None as the label so that MXNet doesn't later complain when bind doesn't provide label - which is what the suggested fix does.

Categories

Resources