How to run tflite on CPU only

How to run tflite on CPU only - python

I have a tflite modelthat runs in coral USB, but I it to run also in CPU (as an alternative to pass some tests when coral USB is not phisicaly available).
I found this very similar question but the answers given are not useful.
My code looks like this:
class CoralObjectDetector(object):
def __init__(self, model_path: str, label_path: str):
"""
CoralObjectDetector, this object allows to pre-process images and perform object detection.
:param model_path: path to the .tflite file with the model
:param label_path: path to the file with labels
"""
self.label_path = label_path
self.model_path = model_path
self.labels = dict() # type: Dict[int, str]
self.load_labels()
self.interpreter = tflite.Interpreter(model_path),
experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])
# more code and operations
Where model and labels are downloaded from here.
I would like to load an alternative version of the same model that let me execute without the coral USB accelerator (i.e. only in CPU). My goal is something as follows:
class CoralObjectDetector(object):
def __init__(self, model_path: str, label_path: str, run_in_coral: bool):
"""
CoralObjectDetector, this object allows to pre-process images and perform object detection.
:param model_path: path to the .tflite file with the model
:param label_path: path to the file with labels
:param run_in_coral: whether or not to run it on coral (use CPU otherwise)
"""
self.label_path = label_path
self.model_path = model_path
self.labels = dict() # type: Dict[int, str]
self.load_labels()
if run_in_coral:
self.interpreter = tflite.Interpreter(model_path),
experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])
else:
# I expect somethig like this
self.interpreter = tflite.CPUInterpreter(model_path)
# more code and operations
I'm not sure if I need just this or something else in the inference/prediction methods.

When you compile a Coral model, it maps all the operations it can to a single TPU Custom OP - for example:
.
This means that this model will only work on the TPU. That being said, your TFLite interpreter can run CPU models too (all we did was add the experimental delegate to handle that edgetpu-custom-op). To run the CPU version, simply pass the CPU version of the model (before it was compiled).
For your object detection, if you use one of the models we provide in test_data, you'll see we provide the CPU and TPU version (for example for MNv1 SSD we have CPU and TPU versions). If you plugged these into any of our code, you'd see both work.
I'd simply check to see if a Coral TPU is attached when picking which model you use.

Related

Loading large pytorch objects on CPU

I generate some data on a gpu and save it using
joblib.dump(results, "./results.sav")
I use joblib rather than pickle as the latter gives memory errors
I subsequently need to read the results file on a machine without a gpu
res = torch.load('./results.sav', map_location=torch.device('cpu'))
This however gives the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I was able to address the equivalent problem when using pickles with:
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
res = sm.CPU_Unpickler(open(f'./results.pkl',"rb")).load()
Does anyone have any advice on how to save large gpu generated pytorch objects and then load them on a cpu?

How to do inference in parallel with tensorflow saved model predictors?

Tensorflow version: 1.14
Our current setup is using tensorflow estimators to do live NER i.e. perform inference one document at a time. We have 30 different fields to extract, and we run one model per field, so got total of 30 models.
Our current setup uses python multiprocessing to do the inferences in parallel. (The inference is done on CPUs.) This approach reloads the model weights each time a prediction is made.
Using the approach mentioned here, we exported the estimator models as tf.saved_model. This works as expected in that it does not reload the weights for each request. It also works fine for a single field inference in one process, but doesn't work with multiprocessing. All the processes hang when we make the predict function (predict_fn in the linked post) call.
This post is related, but not sure how to adapt it for saved model.
Importing tensorflow individually for each of the predictors did not work either:
class SavedModelPredictor():
def __init__(self, model_path):
import tensorflow as tf
self.predictor_fn = tf.contrib.predictor.from_saved_model(model_path)
def predictor_fn(self, input_dict):
return self.predictor_fn(input_dict)
How to make tf.saved_model work with multiprocessing?

Ray Serve, ray's model serving solution, also support offline batching. You can wrap your model in Ray Serve's backend and scale it to the number replica you want.
from ray import serve
client = serve.start()
class MyTFModel:
def __init__(self, model_path):
self.model = ... # load model
#serve.accept_batch
def __call__(self, input_batch):
assert isinstance(input_batch, list)
# forward pass
self.model([item.data for item in input_batch])
# return a list of response
return [...]
client.create_backend("tf", MyTFModel,
# configure resources
ray_actor_options={"num_cpus": 2, "num_gpus": 1},
# configure replicas
config={
"num_replicas": 2,
"max_batch_size": 24,
"batch_wait_timeout": 0.5
}
)
client.create_endpoint("tf", backend="tf")
handle = serve.get_handle("tf")
# perform inference on a list of input
futures = [handle.remote(data) for data in fields]
result = ray.get(futures)
Try it out with the nightly wheel and here's the tutorial: https://docs.ray.io/en/master/serve/tutorials/batch.html
Edit: updated the code sample for Ray 1.0

Ok, so the approach outlined in this answer with ray worked.
Built a class like this, which loads the model on init and exposes a function run to perform prediction:
import tensorflow as tf
import ray
ray.init()
#ray.remote
class MyModel(object):
def __init__(self, field, saved_model_path):
self.field = field
# load the model once in the constructor
self.predictor_fn = tf.contrib.predictor.from_saved_model(saved_model_path)
def run(self, df_feature, *args):
# ...
# code to perform prediction using self.predictor_fn
# ...
return self.field, list_pred_string, list_pred_proba
Then used the above in the main module as:
# form a dictionary with key 'field' and value MyModel
model_dict = {}
for field in fields:
export_dir = f"saved_model/{field}"
subdirs = [x for x in Path(export_dir).iterdir()
if x.is_dir() and 'temp' not in str(x)]
latest = str(sorted(subdirs)[-1])
model_dict[field] = MyModel.remote(field, latest)
Then used the above model dictionary to do predictions like this:
results = ray.get([model_dict[field].run.remote(df_feature) for field in fields])
Update:
While this approach works, found that running estimators in parallel with multiprocessing is faster than running predictors in parallel with ray. This is especially true for large document sizes. It looks like the predictor approach might work well for small number of dimensions and when the input data is not large. Maybe an approach like mentioned here might be better for our use case.

Stable baselines saving PPO model and retraining it again

Hello I am using Stable baselines package (https://stable-baselines.readthedocs.io/), specifically I am using the PPO2 and I am not sure how to properly save my model... I trained it for 6 virtual days and got my average return to around 300, then I have decided that this is not enough for me so I trained the model for another 6 days. But when I looked at the training statistics the second training return per episode started at around 30. This suggest that it did not save all parameters.
this is how I save use the package:
def make_env_init(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def env_init():
# Important: use a different seed for each environment
env = gym.make(env_id, connection=blt.DIRECT)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return env_init
envs = VecNormalize(SubprocVecEnv([make_env_init(f'envs:{env_name}', i) for i in range(processes)]), norm_reward=False)
if os.path.exists(folder / 'model_dump.zip'):
model = PPO2.load(folder / 'model_dump.zip', envs, **ppo_kwards)
else:
model = PPO2(MlpPolicy, envs, **ppo_kwards)
model.learn(total_timesteps=total_timesteps, callback=callback)
model.save(folder / 'model_dump.zip')

The way you saved the model is correct. The training is not a monotonous process: it can also show much worse results after a further training.
What you can do, first of all is to write logs of the progress:
model = PPO2(MlpPolicy, envs, tensorboard_log="./logs/progress_tensorboard/")
In order to see the log, run in terminal:
tensorboard --port 6004 --logdir ./logs/progress_tensorboard/
it will give you the link to the board, which you can then open in a browser (e.g. http://pc0259:6004/)
Secondly, you can make snapshots of the model each X steps:
from stable_baselines.common.callbacks import CheckpointCallback
checkpoint_callback = CheckpointCallback(save_freq=1e4, save_path='./model_checkpoints/')
model.learn(total_timesteps=total_timesteps, callback=[callback, checkpoint_callback])
Combining it with the log, you can pick up the model which performed best!

Can Lucid visualize MobileNet V3 Squeeze/Excite blocks

I've trained a number of graphs using the provided MobileNet V3 definition (small) but when I run (tensorflow) Lucid to generate visualisations Lucid fails with an error. If I modify the definition to exclude the Squeeze/Excite blocks then the visualisations are generated.
With Tensorflow 1.14 and Lucid installed, I downloaded the trained MobileNet V3 graph file "Small dm=0.75 (float)" from here (https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet), extracted the files into my "D:/temp", and ran the following code:
import tensorflow as tf
import lucid.optvis.render as render
from lucid.modelzoo.vision_base import Model
class SSDMobilenetV3( Model ):
def __init__( self, graph_path ):
self.model_path = graph_path
self.input_name = "input"
self.image_shape = [ 224, 224, 3 ]
self.image_value_range = ( -1, 1 )
super().__init__()
model = SSDMobilenetV3( "D:/temp/v3-small_224_0.75_float/v3-small_224_0.75_float.pb" )
model.load_graphdef()
#model.show_graph()
_ = render.render_vis( model, "MobilenetV3/expanded_conv_6/output:0" )
There's a fair bit of stacktrace, but the key errors are:
LookupError: gradient registry has no entry for: AddV2
and
LookupError: No gradient defined for operation 'import/MobilenetV3/expanded_conv_6/squeeze_excite/Conv_1/add' (op type: AddV2)
Then I tried using the V3_SMALL_MINIMALISTIC definition in "mobilenet_v3.py" (registering a new feature extractor) to train a test model. This is essentially the same model but without the "squeeze_excite" insertions (although I also reinstated the hard_swish activation function).
The above code ran fine on the new model, rendering an image.
This leads me to believe that the problem resides in the "squeeze_excite" implementation (in slim/nets/mobilenet/conv_blocks.py).
But I have not been able to diagnose the problem further: is it Lucid, is it the Squeeze/Excite block, is it TensorFlow, or is it just a fact about the world?

Description of TF Lite's Toco converter args for quantization aware training

These days I am trying to track down an error concerning the deployment of a TF model with TPU support.
I can get a model without TPU support running, but as soon as I enable quantization, I get lost.
I am in the following situation:
Created a model and trained it
Created an eval graph of the model
Froze the model and saved the result as protocol buffer
Successfully converted and deployed it without TPU support
For the last point, I used the TFLiteConverter's Python API. The script that produces a functional tflite model is
import tensorflow as tf
graph_def_file = 'frozen_model.pb'
inputs = ['dense_input']
outputs = ['dense/BiasAdd']
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, inputs, outputs)
converter.inference_type = tf.lite.constants.FLOAT
input_arrays = converter.get_input_arrays()
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)
This tells me that my approach seems to be ok up to this point. Now, if I want to utilize the Coral TPU stick, I have to quantize my model (I took that into account during training). All I have to do is to modify my converter script. I figured that I have to change it to
import tensorflow as tf
graph_def_file = 'frozen_model.pb'
inputs = ['dense_input']
outputs = ['dense/BiasAdd']
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, inputs, outputs)
converter.inference_type = tf.lite.constants.QUANTIZED_UINT8 ## Indicates TPU compatibility
input_arrays = converter.get_input_arrays()
converter.quantized_input_stats = {input_arrays[0]: (0., 1.)} ## mean, std_dev
converter.default_ranges_stats = (-128, 127) ## min, max values for quantization (?)
converter.allow_custom_ops = True ## not sure if this is needed
## REMOVED THE OPTIMIZATIONS ALTOGETHER TO MAKE IT WORK
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)
This tflite model produces results when loaded with the Python API of the interpreter, but I am not able to understand their meaning. Also, there is no (or if there is, it is hidden well) documentation on how to choose mean, std_dev and the min/max ranges. Also, after compiling this with the edgetpu_compiler and deploying it (loading it with the C++ API), I receive an error:
INFO: Initialized TensorFlow Lite runtime.
ERROR: Failed to prepare for TPU. generic::failed_precondition: Custom op already assigned to a different TPU.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.
Segmentation fault
I suppose I missed a flag or something during the conversion process. But as the documentation is also lacking here, I can't say for sure.
In short:
What do the params mean, std_dev, min/max do and how do they interact?
What am I doing wrong during the conversion?
I am grateful for any help or guidance!
EDIT: I have opened a github issue with the full test code. Feel free to play around with this.

You should never need to manually set the quantization stats.
Have you tried the post-training-quantization tutorials?
https://www.tensorflow.org/lite/performance/post_training_integer_quant
Basically they set the quantization options:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
Then they pass a "representative dataset" to the converter, so that the converter can run the model a few batches to gather the necessary statistics:
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
converter.representative_dataset = representative_data_gen
While there are options for quantized training, it's always easier to to do post-training quantization.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run tflite on CPU only - python

Related

Loading large pytorch objects on CPU

How to do inference in parallel with tensorflow saved model predictors?

Stable baselines saving PPO model and retraining it again

Can Lucid visualize MobileNet V3 Squeeze/Excite blocks

Description of TF Lite's Toco converter args for quantization aware training

Categories

Resources