Speeding-up inference of T5-like model

Speeding-up inference of T5-like model - python

I am currently using a model called T0pp (https://huggingface.co/bigscience/T0pp) in production and would like to speed up inference.
I am running the following code on an on-demand EC2 g4dn.12xlarge instance (4 Nvidia T4 GPUs):
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")
model.parallelize()
input_dict = tokenizer(generation_input.inputs, return_tensors="pt", padding=True)
inputs = input_dict.input_ids.to("cuda:0")
attention_mask = input_dict.attention_mask.to("cuda:0")
with torch.no_grad():
outputs = model.generate(inputs, attention_mask=attention_mask)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
I wanted to know which alternative you would try in order to speed-up inference, and if you knew good tutorials to do so. The main alternatives I see to speed-up inference would be to use the underlying Pytorch models with:
ONNX
Deepspeed
or using fp16 instead of fp32 parameters (with the main drawback of losing some quality)
Would someone have experience in using these tools, and would know which is the best / simplest option?
All this is quite new for me, and I must admit I've been a bit lost in ONNX and Deepspeed tutorials.
PS:
I already tried SageMaker, but this is not working for huge models like T0pp (40Gb).
Batching speeds up things, allowing to go from 1-2 seconds for batch
size 1, to 16 seconds for batch size 32. In an ideal world, even
batch size 32 would be under 1 or 2 seconds.

Maybe you could try OpenVINO? It allows you to convert your model into Intermediate Representation (IR) and then run on the CPU with the FP16 support. OpenVINO is optimized for Intel hardware but it should work with any processor. I cannot guarantee your model will be faster on CPU than Nvidia GPU but it's worth giving it a try. Some NLP models are fast enough (like this BERT).
You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool that comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). The accuracy drop, in most cases, is insignificant. Run in command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
It's worth mentioning that Runtime can process the ONNX model directly. In that case, just skip the conversion (Model Optimizer) step and give onnx path to the read_model function.
Disclaimer: I work on OpenVINO.

Related

Tensorflow lite model inference is very slow compared to keras h5 model (VGG16 pretrained)

Tensorflow lite predictions are extremely slow compared to keras (h5) model. The behavior is similar between Colab and also on Windows 10 system. I converted the standard VGG16 model to tflite both with and without optimization (converter.optimizations = [tf.lite.Optimize.DEFAULT])
Here are the results I got:
Keras model (540MB) prediction time: 0.14 seconds
tflite without optimization (540MB) prediction time: 0.5 seconds
tflite with optimization (135MB) prediction time: 39 seconds
Am I missing something here? Isn't tflite supposed to be optimized for speed? Would the behavior be different on Raspberry Pi or other 'lighter' devices?
Link to the code on colab

TensorFlow Lite isn't optimized for desktop/server, so its not surprising that it performs badly for most models in those environments. TFLite's optimized kernels (including lot of their GEMM operations) are especially geared for mobile CPUs (which don't have the same instruction set as desktop CPUs IIUC).
Standard TensorFlow is better for your use-case.

I agree with Sachin. TFLite's intended use is mobile devices.
However, if you need faster inference on a desktop or server you can try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). I suggest using an AUTO device. It will choose the best hardware for you. Moreover, if you care about latency, provide that performance hint (as shown below). If you depend on throughput, use THROUGHPUT or CUMULATIVE_THROUGHPUT instead.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Why a quantized TensorFlow Lite model performs poorly on latency?

I am currently testing the latency of the inference of a U-Net network transformed with TensorFlow Lite. I am testing three NN with the same architecture on a segmentation problem (I'm testing them on my laptop with Windows OS):
First model: TensorFlow model (without optimization and created with the Keras interface).
Second model: TensorFlow model optimized with TFLite (transformed with the Python TFLite api and without quantization). It is actually the first model transformed.
Third model: TensorFlow model optimized with TFLite and quantized (transformed with the Python TFLite api and quantized with tensorflow.lite.Optimize.DEFAULT). It is actually the first model transformed.
Indeed, the second model (optimized with TFLite) improves the time performance of the first model (normal TF model) by a factor of x3 (three times faster). However, the third model (TFLite & quantization) has the worst performance time-wise. It is even slower than the first model (normal TF model).
Why the quantized model is the slowest?

It depends on which kernels your model is running.
Generally TFLite is more optimized for running on mobile devices. So it might be that in your case quantized+desktop it is using a reference implementation for some op(s).
One way to check further is to run the benchmark tool with --enable_op_profiling=true.
It will run your model with dummy data and profile the ops, and then show you summary like this
If you saw something off, then you can file a github issue with details and how to reproduce the issue and the team can debug the performance issue.

convert .pb model into quantized tflite model

Totally new to Tensorflow,
I have created one object detection model (.pb and .pbtxt) using 'faster_rcnn_inception_v2_coco_2018_01_28' model I found from TensorFlow zoo. It works fine on windows but I want to use this model on google coral edge TPU. How can I convert my frozen model into edgetpu.tflite quantized model?

There are 2 more steps to this pipeline:
1) Convert the .pb -> tflite:
I won't go through details since there are documentation on this on tensorflow official page and it changes very often, but I'll still try to answer specifically to your question. There are 2 ways of doing this:
Quantization Aware Training: this happens during training of the model. I don't think this applies to you since your question seems to indicates that you were not aware of this process. But please correct me if I'm wrong.
Post Training Quantization: Basically loading your model where all tensors are of type float and convert it to a tflite form with int8 tensors. Again, I won't go into too much details, but I'll give you 2 actual examples of doing so :) a) with code
b) with tflite_convert tools
2) Compile the model from tflite -> edgetpu.tflite:
Once you have produced a fully quantized tflite model, congrats your model is now much more efficient for arm platform and the size is much smaller. However it will still be ran on the CPU unless you compile it for the edgetpu. You can review this doc for installation and usage. But compiling it is as easy as:
$ edgetpu_compiler -s your_quantized_model.tflite
Hope this helps!

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Keras VGG16 predict speed slow

I'm working on a feature extractor for this transfer learning personal project, and the predict function of Kera's VGG16 model seems pretty slow (31 seconds for a batch of 4 images). I do expect it to be slow, but not sure if the prediction function is slower than it should be.
data = DataGenerator()
data = data.from_csv(csv_path=csv_file,
img_dir=img_folder,
batch_size=batch)
#####################################################
conv_base = VGG16(include_top=False,
weights='imagenet',
input_shape=(480, 640, 3))
model = Sequential()
model.add(conv_base)
model.add(MaxPooling2D(pool_size=(3, 4)))
model.add(Flatten())
######################################################
for inputs, y in data:
feature_batch = model.predict(inputs)
yield feature_batch, y
So, my hunch is that it is slow for these reasons:
my input data is a bit large (loading in (480, 640, 3) size images)
I am running on a weak CPU (M3-6Y30 # 0.90GHz)
I have a flatten operation at the end of the feature extractor.
Things I've tried:
Other StackOverFlow posts suggested adding a max pooling layer to
reduce the feature size / remove the extraneous zero's. I made I
think a pretty large max pool window (thus reducing the feature size
significantly, but my prediction time increased.
Batch processing doesn't improve time which is probably obvious due
to the use of my M3 CPU). A batch size of 1 image takes 8 seconds, a
batch size of 4 takes 32.
Are there any ideas on how to speed up the prediction function? I need to run this through at least 10,000 images, and due to the nature of the project I would like to retain as much of the raw data as possible before going into the model (will be comparing it with other feature extraction models)
All my image files are saved locally, but I can try to setup a cloud computer and move my code over there to run with GPU support.
Is the issue simply I am running the VGG16 model on a dinky CPU?
Guidance would be much appreciated.

There are many issues with your model. The main issue is of course really slow machine, but as you cannot change that here I will state some pieces of advice on how you could speed up your computations:
VGG16 is relatively old architecture. The main issue here is that the so-called volume of tensors (area of feature maps times number of features) is decreased really slowly. I would advise you to use more modern architectures like e.g. ResNet50 or Inception v3 as they have the so-called stem which is making inside tensors much smaller really fast. Your speed should benefit thanks to that. There is also a really light architecture called MobileNet which seems perfect for your task.
Downsample your images - with a size of (480, 640) your image is 6 times bigger than default VGG input. This makes all computations 6 times slower. You could try to first downsample images and then use a feature extractor.

VGG16 is a very big model. The same accuracy could be reached with modern smaller models such as MobileNetV3 or EfficientNet.
However, if you have to use your model you could try OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs. Your processor (M3-6Y30) is 6th generation so it should be supported.
It's rather straightforward to convert the Keras model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding-up inference of T5-like model - python

Related

Tensorflow lite model inference is very slow compared to keras h5 model (VGG16 pretrained)

Why a quantized TensorFlow Lite model performs poorly on latency?

convert .pb model into quantized tflite model

Keras tf backend predict speed slow for batch size of 1

Keras VGG16 predict speed slow

Categories

Resources