Convert Keras MobileNet model to TFLite with 8-bit quantization - python

I have used Keras to finetune MobileNet v1. Now I have model.h5 and I need to convert it to TensorFlow Lite to use it in Android app.
I use TFLite conversion script tflite_convert. I can convert it without quantization but I need more performance so I need to make quantization.
If I run this script:
tflite_convert --output_file=model_quant.tflite \
--keras_model_file=model.h5 \
--inference_type=QUANTIZED_UINT8 \
--input_arrays=input_1 \
--output_arrays=predictions/Softmax \
--mean_values=128 \
--std_dev_values=127 \
It fails:
F tensorflow/contrib/lite/toco/] Array
conv1_relu/Relu6, which is an input to the DepthwiseConv operator
producing the output array conv_dw_1_relu/Relu6, is lacking min/max
data, which is necessary for quantization. If accuracy matters, either
target a non-quantized output format, or run quantized training with
your model from a floating point checkpoint to change the input graph
to contain min/max information. If you don't care about accuracy, you
can pass --default_ranges_min= and --default_ranges_max= for easy
experimentation.\nAborted (core dumped)\n"
If I use default_ranges_min and default_ranges_max (called as "dummy-quantization"), it works but it is only for debugging performance without accuracy as it is described in error log.
So what I need to do to make Keras model correctly quantizable? Do I need to find best default_ranges_min and default_ranges_max? How? Or it is about changes in Keras training phase?
Library versions:
Python 3.6.4
TensorFlow 1.12.0
Keras 2.2.4

Unfortunately, Tensorflow does not provide the tooling for post-training per layer quantization in flatbuffer (tflite) yet, but only in protobuf. The only available way now is to introduce fakeQuantization layers in your graph and re-train / fine-tune your model on the train or a calibration set. This is called "Quantization-aware training".
Once the fakeQuant layers are introduced, then you can feed the training set and TF is going to use them on Feed-Forward as simulated quantisation layers (fp-32 datatypes that represent 8-bit values) and back-propagate using full precision values. This way, you can get back the accuracy loss that caused by quantization.
In addition, the fakeQuant layers are going to capture the ranges per layer or per channel through moving average and store them in min / max variables.
Later, you can extract the graph definition and get rid of the fakeQuant nodes through freeze_graph tool.
Finally, the model can be fed into tf_lite_converter (cross-fingers it won't brake) and extract the u8_tflite with captured ranges.
A very good white-paper, explaining all these is provided by Google here :
Hope that helps.


How do you identify a sparse tensor for output purposes?

To get the prediction / output of my pre-trained model; the model predicts a symbol for each frame (column) of the convoluted image and it is necessary to conduct post-processing of the logits (output of the RNN) to emit the actual sequence of predicted symbols. Code for model construction can be found here.
logits = graph.get_tensor_by_name("fully_connected/BiasAdd:0")
decoded, _ = tf.nn.ctc_greedy_decoder(logits, seq_len)
prediction =,
input: image,
seq_len: seq_lengths,
rnn_keep_prob: 1.0,
Prediction is a SparseTensorValue containing every predicted symbol. Decoded is a sparse tensor of non-empty tensors. Ultimately, I parse the resulting SparseTensorValue for the strings I need.
I want to use this trained model for inference either through tensorflow serving or tflite, however in order to proceed I would need to indicate the output nodes for the model. Given the nature of sparse tensors, I won't be able to indicate it by name. Is there a way for me to use this model for proper inference?
I've seen many examples of using ctc decoders such as this in a similar way for prediction, however, there were no examples of using these models for inference without closely relying on the tensorflow api, I am unsure how to proceed.
You can save your model to the tf saved_model format. After that you can use the CLI tool saved_model_cli of the package tensorflow-serving-api to inspect all model signatures with: saved_model_cli show --dir . --all. Withit you will see all information of the input and output shape(s). The default signature is called default_serving.

Difference in prediction values for a binary classifier in and Tensorflow

I was comparing results between the same model generated and used between two frameworks ( and tensorflow/keras) independently.
These are the steps in followed in getting the results:
Train Model in Keras save as .h5 format.
Convert to .pb model to use in, as currently uses tensorflow graphs and not saved model format.
Now for the same two images, the prediction value should be the same. As the model is the same. The difference being the underlying framework.
I could find only bits and pieces regarding why this occours. I am running both the programs with same underlying hardware and software configurations.
Is the underlying cause the difference in framework?
The image prediction scores for example images : : 0.9929248 Fail
keras/tensorflow : 0.99635077 Fail
The difference is minute, but being deterministic irrespective of platform is a feature that I am looking for.

How to save a trained Neural Network with TensorFlow2

My problem is that after creating and training a neural net with TensorFlow (version 2.1.0) I need to extrapolate all the basic parameters: net architecture, functions used and weight values found through training.
These parameters will then be read by a library that will generate the VHDL code to bring the neural network created on python on an FPGA.
So I wanted to ask if there are one or more methods to get all this information, not in binary format. Among all these values the most important one is the extrapolation of the value of the weights found at the end of the training.

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?
The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.
One way that gave me a speed up was switching from model.predict(x) to,
making sure your x shape has 1 as the first dimension.
I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer}), 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Alternative to Lambda layer in Keras

I try to convert the Keras OCR example into a CoreML model.
I already can train my slightly modified model and everything looks good in Python. But now I want to convert the model into CoreML to use it my iOS app.
The problem is, that the CoreML file format can't support Lambda layers.
I am not an expert in this field, but as far as I understand, the Lambda layer here is used to calculate the loss using ctc_batch_cost().
The layer is created around line 464.
I guess this is used for greater precision over the "build in" loss functions.
Is there any way the model creation can be rewritten to fit the layer set CoreML supports?
I have no idea which output layer type to use for the model.
Cost functions usually aren't included in the CoreML model, since CoreML only does inference while cost functions are used for training. So strip out that layer before you export the model and you should be good to go.

