I've trained a number of graphs using the provided MobileNet V3 definition (small) but when I run (tensorflow) Lucid to generate visualisations Lucid fails with an error. If I modify the definition to exclude the Squeeze/Excite blocks then the visualisations are generated.
With Tensorflow 1.14 and Lucid installed, I downloaded the trained MobileNet V3 graph file "Small dm=0.75 (float)" from here (https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet), extracted the files into my "D:/temp", and ran the following code:
import tensorflow as tf
import lucid.optvis.render as render
from lucid.modelzoo.vision_base import Model
class SSDMobilenetV3( Model ):
def __init__( self, graph_path ):
self.model_path = graph_path
self.input_name = "input"
self.image_shape = [ 224, 224, 3 ]
self.image_value_range = ( -1, 1 )
super().__init__()
model = SSDMobilenetV3( "D:/temp/v3-small_224_0.75_float/v3-small_224_0.75_float.pb" )
model.load_graphdef()
#model.show_graph()
_ = render.render_vis( model, "MobilenetV3/expanded_conv_6/output:0" )
There's a fair bit of stacktrace, but the key errors are:
LookupError: gradient registry has no entry for: AddV2
and
LookupError: No gradient defined for operation 'import/MobilenetV3/expanded_conv_6/squeeze_excite/Conv_1/add' (op type: AddV2)
Then I tried using the V3_SMALL_MINIMALISTIC definition in "mobilenet_v3.py" (registering a new feature extractor) to train a test model. This is essentially the same model but without the "squeeze_excite" insertions (although I also reinstated the hard_swish activation function).
The above code ran fine on the new model, rendering an image.
This leads me to believe that the problem resides in the "squeeze_excite" implementation (in slim/nets/mobilenet/conv_blocks.py).
But I have not been able to diagnose the problem further: is it Lucid, is it the Squeeze/Excite block, is it TensorFlow, or is it just a fact about the world?
Related
I am trying to use the beta Google Custom Prediction Routine in Google's AI Platform to run a live version of my model.
I include in my package predictor.py which contains a Predictor class as such:
import os
import numpy as np
import pickle
import keras
from keras.models import load_model
class Predictor(object):
"""Interface for constructing custom predictors."""
def __init__(self, model, preprocessor):
self._model = model
self._preprocessor = preprocessor
def predict(self, instances, **kwargs):
"""Performs custom prediction.
Instances are the decoded values from the request. They have already
been deserialized from JSON.
Args:
instances: A list of prediction input instances.
**kwargs: A dictionary of keyword args provided as additional
fields on the predict request body.
Returns:
A list of outputs containing the prediction results. This list must
be JSON serializable.
"""
# pre-processing
preprocessed_inputs = self._preprocessor.preprocess(instances[0])
# predict
outputs = self._model.predict(preprocessed_inputs)
# post-processing
outputs = np.array([np.fliplr(x) for x in x_test])
return outputs.tolist()
#classmethod
def from_path(cls, model_dir):
"""Creates an instance of Predictor using the given path.
Loading of the predictor should be done in this method.
Args:
model_dir: The local directory that contains the exported model
file along with any additional files uploaded when creating the
version resource.
Returns:
An instance implementing this Predictor class.
"""
model_path = os.path.join(model_dir, 'keras.model')
model = load_model(model_path, compile=False)
preprocessor_path = os.path.join(model_dir, 'preprocess.pkl')
with open(preprocessor_path, 'rb') as f:
preprocessor = pickle.load(f)
return cls(model, preprocessor)
The full error Create Version failed. Bad model detected with error: "Failed to load model: Unexpected error when loading the model: 'str' object has no attribute 'decode' (Error code: 0)" indicates that the issue is in this script, specifically when loading the model. However, I am able to successfully load the model in my notebook locally with the same code block in predict.py:
from keras.models import load_model
model = load_model('keras.model', compile=False)
I have seen similar posts which suggest to set the version of h5py<3.0.0 but this hasn't helped. I can set versions of modules for my custom prediction routine as such in a setup.py file:
from setuptools import setup
REQUIRED_PACKAGES = ['keras==2.3.1', 'h5py==2.10.0', 'opencv-python', 'pydicom', 'scikit-image']
setup(
name='my_custom_code',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
version='0.23',
scripts=['predictor.py', 'preprocess.py'])
Unfortunately, I haven't found a good way to debug model deployment in google's AI Platform and the troubleshooting guide is unhelpful. Any pointers would be much appreciated. Thanks!
Edit 1:
The h5py module's version is wrong –– at 3.1.0, despite setting it to 2.10.0 in setup.py. Anyone know why? I confirmed that Keras version and other modules are set properly however. I've tried 'h5py==2.9.0' and 'h5py<3.0.0' to no avail. More on including PyPi package dependencies here.
Edit 2:
So it turns out google currently does not support this capability.
StackOverflow, enzed01
I have encountered the same problem with using AI platform with code that was running fine two months ago, when we last trained our models. Indeed, it is due to the dependency on h5py which fails to load the h5 model out of the blue.
After a while I was able to make it work with runtime 2.2 and python version 3.7. I am also using the custom prediction routine and my model was a simple 2-layer bidirectional LSTM serving classifications.
I had a notebook VM set up with TF == 2.1 and downgraded h5py to <3.0.0 with:
!pip uninstall -y h5py
!pip install 'h5py < 3.0.0'
My setup.py looks like this:
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==2.1', 'h5py<3.0.0']
setup(
name="my_package",
version="0.1",
include_package_data=True,
scripts=["preprocess.py", "model_prediction.py"]
)
I added compile=False to my model load code. Without it, I ran into another problem with deployment which was giving following error: Create Version failed. Bad model detected with error: "Failed to load model: Unexpected error when loading the model: 'sample_weight_mode' (Error code: 0)"
The code change from OP:
model = keras.models.load_model(
os.path.join(model_dir,'model.h5'), compile = False)
And this made the model be deployed as before without a problem. I suspect the
compile=False might mean slower prediction serving, but have not noticed anything so far.
Hope this helps anyone stuck and googling these issues!
thanks for your atention, I'm developing an automatic speaker recognition system using SincNet.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
Since the network is coded in Pytorch I searched and found a Keras implementation here https://github.com/grausof/keras-sincnet. I adapted the train.py code to train a Sincnet with my own data in Tensorflow 2.0, and worked fine, I saved only the weights of my trained network, my training data has shape 128,3200,1 for inputs and 128 for labels per batch
#Creates a Sincnet model with input_size=3200 (wlen), num_classes=40, fs=16000
redsinc = create_model(wlen,num_classes,fs)
#Saves only weights and stopearly callback
checkpointer = ModelCheckpoint(filepath='checkpoints/SincNetBiomex3.hdf5',verbose=1,
save_best_only=True, monitor='val_accuracy',save_weights_only=True)
stopearly = EarlyStopping(monitor='val_accuracy',patience=3,verbose=1)
callbacks = [checkpointer,stopearly]
# optimizer = RMSprop(lr=learnrate, rho=0.9, epsilon=1e-8)
optimizer = Adam(learning_rate=learnrate)
# Creates generator of training batches
train_generator = batchGenerator(batch_size,train_inputs,train_labels,wlen)
validinputs, validlabels = create_batches_rnd(validation_labels.shape[0],
validation_inputs,validation_labels,wlen)
#Compiling model and train with function fit_generator
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = redsinc.fit_generator(train_generator, steps_per_epoch=N_batches, epochs = epochs,
verbose = 1, callbacks=callbacks, validation_data=(validinputs,validlabels))
The problem came when I tried to evaluate the network, I didn't use the code found in test.py, I only loaded the weights I previously saved and use the function evaluate, my test data had the shape 1200,3200,1 for the inputs and 1200 for labels.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
test_loss, test_accuracy = redsinc.evaluate(x=eval_in,y=eval_lab)
RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer,
loss)`.
Then I added the same compile code I used for training:
optimizer = Adam(learning_rate=0.001)
redsinc.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Then rerun the test code and got this:
WARNING:tensorflow:From C:\Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-
packages\tensorflow_core\python\ops\resource_variable_ops.py:1781: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is
deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
ValueError: A tf.Variable created inside your tf.function has been garbage-collected. Your code needs to keep Python references to variables created inside `tf.function`s.
A common way to raise this error is to create and return a variable only referenced inside your function:
#tf.function
def f():
v = tf.Variable(1.0)
return v
v = f() # Crashes with this error message!
The reason this crashes is that #tf.function annotated function returns a **`tf.Tensor`** with the **value** of the variable when the function is called rather than the variable instance itself. As such there is no code holding a reference to the `v` created inside the function and Python garbage collects it.
The simplest way to fix this issue is to create variables outside the function and capture them:
v = tf.Variable(1.0)
#tf.function
def f():
return v
f() # <tf.Tensor: ... numpy=1.>
v.assign_add(1.)
f() # <tf.Tensor: ... numpy=2.>
I don't understand the error since I've evaluated other networks with the same function and never got any problems. Then I decided to use predict function to match predicted labels with correct labels and obtain all metrics with my own code but I got another error.
# Create a Sincnet model and load previously saved weights
redsinc = create_model(wlen,num_clases,fs)
redsinc.load_weights('checkpoints/SincNetBiomex3.hdf5')
print('Model loaded')
#Predict labels with test data
predict_labels = redsinc.predict(eval_in)
Error while reading resource variable _AnonymousVar212 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar212/class tensorflow::Var does not exist.
[[node sinc_conv1d/concat_104/ReadVariableOp (defined at \Users\atenc\Anaconda3\envs\py3.7-tf2.0gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_13649]
Function call stack:
keras_scratch_graph
I hope someone can tell me what these errors mean and how to solve them, I've searched for solutions to them but most of the solutions I've found don't seem related to my problem so I can't apply those solutions. I'm guessing the errors are caused by the Sincnet layer code, because it is a custom coded layer. The code for Sincnet layer can be found in the github repository in the file sincnet.py.
I appreciate all help I can get, again thank you for your atention.
You should downgrade your tf and keras version, it works to me when I faced the same problem.
Try this keras==2.1.6; tensorflow-gpu==1.13.1
These days I am trying to track down an error concerning the deployment of a TF model with TPU support.
I can get a model without TPU support running, but as soon as I enable quantization, I get lost.
I am in the following situation:
Created a model and trained it
Created an eval graph of the model
Froze the model and saved the result as protocol buffer
Successfully converted and deployed it without TPU support
For the last point, I used the TFLiteConverter's Python API. The script that produces a functional tflite model is
import tensorflow as tf
graph_def_file = 'frozen_model.pb'
inputs = ['dense_input']
outputs = ['dense/BiasAdd']
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, inputs, outputs)
converter.inference_type = tf.lite.constants.FLOAT
input_arrays = converter.get_input_arrays()
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)
This tells me that my approach seems to be ok up to this point. Now, if I want to utilize the Coral TPU stick, I have to quantize my model (I took that into account during training). All I have to do is to modify my converter script. I figured that I have to change it to
import tensorflow as tf
graph_def_file = 'frozen_model.pb'
inputs = ['dense_input']
outputs = ['dense/BiasAdd']
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, inputs, outputs)
converter.inference_type = tf.lite.constants.QUANTIZED_UINT8 ## Indicates TPU compatibility
input_arrays = converter.get_input_arrays()
converter.quantized_input_stats = {input_arrays[0]: (0., 1.)} ## mean, std_dev
converter.default_ranges_stats = (-128, 127) ## min, max values for quantization (?)
converter.allow_custom_ops = True ## not sure if this is needed
## REMOVED THE OPTIMIZATIONS ALTOGETHER TO MAKE IT WORK
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)
This tflite model produces results when loaded with the Python API of the interpreter, but I am not able to understand their meaning. Also, there is no (or if there is, it is hidden well) documentation on how to choose mean, std_dev and the min/max ranges. Also, after compiling this with the edgetpu_compiler and deploying it (loading it with the C++ API), I receive an error:
INFO: Initialized TensorFlow Lite runtime.
ERROR: Failed to prepare for TPU. generic::failed_precondition: Custom op already assigned to a different TPU.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.
Segmentation fault
I suppose I missed a flag or something during the conversion process. But as the documentation is also lacking here, I can't say for sure.
In short:
What do the params mean, std_dev, min/max do and how do they interact?
What am I doing wrong during the conversion?
I am grateful for any help or guidance!
EDIT: I have opened a github issue with the full test code. Feel free to play around with this.
You should never need to manually set the quantization stats.
Have you tried the post-training-quantization tutorials?
https://www.tensorflow.org/lite/performance/post_training_integer_quant
Basically they set the quantization options:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
Then they pass a "representative dataset" to the converter, so that the converter can run the model a few batches to gather the necessary statistics:
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
converter.representative_dataset = representative_data_gen
While there are options for quantized training, it's always easier to to do post-training quantization.
I'm currently using fast.ai to train an image classifier model.
data = ImageDataBunch.single_from_classes(path, classes, ds_tfms=get_transforms(), size=224).normalize(imagenet_stats)
learner = cnn_learner(data, models.resnet34)
learner.model.load_state_dict(
torch.load('stage-2.pth', map_location="cpu")
)
which results in :
torch.load('stage-2.pth', map_location="cpu") File
"/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Sequential:
...
Unexpected key(s) in state_dict: "model", "opt".
I have looked around in SO and tried to use the following solution:
# original saved file with DataParallel
state_dict = torch.load('stage-2.pth', map_location="cpu")
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
learner.model.load_state_dict(new_state_dict)
which results in :
RuntimeError: Error(s) in loading state_dict for Sequential:
Unexpected key(s) in state_dict: "".
I'm using Google Colab to train my model and then port the trained model into docker and try to host in in a local server.
What could be the issue? Could it be the different version of pytorch which results in model mismatch?
In my docker config:
# Install pytorch and fastai
RUN pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
RUN pip install fastai
While my Colab is using the following:
!curl -s https://course.fast.ai/setup/colab | bash
My strong guess is that stage-2.pth contains two top-level items: the model itself (its weights) and the final state of the optimizer which was used to train it. To load just the model, you need only the former. Assuming things were done in the idiomatic PyTorch way, I would try
learner.model.load_state_dict(
torch.load('stage-2.pth', map_location="cpu")['model']
)
Update: after applying my first round of advice it becomes clear that you're loading a savepoint create with a different (perhaps differently configured?) model than the one you're loading it into. As you can see in the pastebin, the savepoint contains weights for some extra layers, not present in your model, such as bn3, downsample, etc.
"0.4.0.bn3.running_var", "0.4.0.bn3.num_batches_tracked", "0.4.0.downsample.0.weight"
at the same time some other key names match, but the tensors are of different shapes.
size mismatch for 0.5.0.downsample.0.weight: copying a param with shape torch.Size([512, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 1, 1]).
I see a pattern that you consistently try to load a parameter of shape [2^(x+1), 2^x, 1, 1] in place of [2^(x), 2^(x-1), 1, 1]. Perhaps you're trying to load a model of different depth (ex. loading vgg-16 weights for vgg-11?). Either way, you need to figure out the exact architecture used to create your savepoint and then recreate it before loading the savepoint.
PS. In case you weren't sure - savepoints contain model weights, along with their shapes and (autogenerated) names. They do not contain the full specification of the architecture itself - you need to assure yourself, that you're calling model.load_state_dict with model being of exactly the same architecture as was used to create the savepoint. Otherwise you will likely have weight names mismatching.
In some deep learning workflows, it is useful to train a model, extract it out of its graph using tf.graph_util.convert_variables_to_constants or tf.graph_util.extract_sub_graph so training-related tensors are left out, and then connect the extracted subgraph to other model(s) via tf.import_graph_def. In this way, the trained model can serve as a building block in a larger setup.
Often, we'd like to backpropagate through the new, composite model, in order to fine-tune it, optimize the inputs and so on.
However, it appears that one cannot define a gradient through a while_loop tensorflow operation in an imported graph, since it relies on 'outer context', an object added into the metagraph's collections (see TF issue #7404). Slightly adapting the example in this Github issue, here's an example of what I am trying to do:
import tensorflow as tf
g1=tf.Graph()
sess1=tf.Session(graph=g1)
with g1.as_default():
with sess1.as_default():
i=tf.constant(0, name="input")
out=tf.while_loop(lambda i: tf.less(i,5), lambda i: [tf.add(i,1)], [i], name="output")
loss=tf.square(out,name='loss')
graph_def = tf.graph_util.convert_variables_to_constants(sess1,g1.as_graph_def(),['output/Exit'])
g2 = tf.Graph()
with g2.as_default():
tf.import_graph_def(graph_def,name='')
i_imported = g2.get_tensor_by_name("input:0")
out_imported = g2.get_tensor_by_name("output/Exit:0")
tf.gradients(out_imported, i_imported)
The last line raises an AttributeError: 'NoneType' object has no attribute 'outer_context' error.
Tensorflow's solution to this issue is to use tf.train.export_meta_graph and tf.train.import_meta_graph so the outer context is copied, but this copies the entire graph, without editting. In this minimal case, the 'loss' tensor won't be removed.
I tried copying the missing context to the new graph:
g2.add_to_collection('while_context',g1.get_collection('while_context'))
But it doesn't solve the issue.
Is there a way to overcome this limitation or is it an irreparable Tensorflow design flaw?