easyOCR allocates GPU even when on gpu=False

easyOCR allocates GPU even when on gpu=False - python

I need to use multiple (my) TF CNN models for a rather complex task. I also need to use easyOCR (PyTorch) every now and then. However easyOCR usage is pretty rare and the task is very small in comparison to TF models inference. Therefore I use gpu=False in easyocr.Reader constructor. Nevertheless, as soon as easyocr predicts anything, GPU is allocated for for pyTorch (This is a known bug, I already checked easyOCR's github issues) and any TF model throws error:
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so tr
y looking to see if a warning log message was printed above.
If I predict with any TF model first, the easyocr model throws:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
I have found a workaround, but it seems rather dangerous to put something like this into production.
Is there a more safe way of achieving this?

Related

Tf.mirrored strategy enabling eager execution of code

I am using tensorflow2.6 and my code requires setting the below code at starting because I use symbolic keras tensor in partial loss in my model
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
At the same time I also want to train on multiple gpus hence I used mirrored strategy but the issue is mirrored strategy requires setting eager execution which fails with my above disableness. Please help me if there exists another way of training on multiple gpus.
I have tried calling my code with the below but it got stuck saying significant overhead ahead.
tf.config.run_functions_eagerly(True)
but this is wrong i believe since as i mentioned** i need the disableness of eagerly mode.**

Tensorflow model quantization best strategy

I'm perplexed by the Tensorflow post-training quantization process. The official site refers to Tensorflow Lite Quantization. Unfortunately, this doesn't work in my case, that is, TFLiteConverter returns errors for my Mask RCNN model:
Some of the operators in the model are not supported by the standard TensorFlow Lite runtime and are not recognized by TensorFlow. If you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: <...>. Here is a list of operators for which you will need custom implementations: DecodeJpeg, StatelessWhile.
Basically, I've tried all available options offered by TFLiteConverter including experimental ones. I'm not quite surprised with those errors as it might make sense not to support decodejpeg for the mobile, however, I want my model to be served by Tensorflow Serving, thus I don't know why Tensorflow Lite is the official choice to go for.
I've also tried Graph Transform Tool, which seems to be deprecated, and discovered 2 issues. Firstly it's impossible to quantize with bfloat16 or float16, only int8. Secondly, the quantized model breaks with the error:
Broadcast between [1,20,1,20,1,256] and [1,1,2,1,2,1] is not supported yet
what isn't an issue in the regular model.
Furthermore, it's worth to mention my model was originally built with Tensorflow 1.x, and then ported to Tensorflow 2.1 via tensorflow.compat.v1.
This issue stole a significant amount of my time. I'd be grateful for any cue.

You can convert the model to Tensorflow Lite and use unsupported ops (Like DecodeJpeg) from TF, this is called SELECT TF OPS, see the guide here on how to enable it during conversion.

Tensorflow Hub: Fine-tune and evaluate

Let's say that I want to fine tune one of the Tensorflow Hub image feature vector modules. The problem arises because in order to fine-tune a module, the following needs to be done:
module = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3", trainable=True, tags={"train"})
Assuming that the module is Resnet50.
In other words, the module is imported with the trainable flag set as True and with the train tag. Now, in case I want to validate the model (perform inference on the validation set in order to measure the performance of the model), I can't switch off the batch-norm because of the train tag and the trainable flag.
Please note that this question has already been asked here Tensorflow hub fine-tune and evaluate but no answer has been provided.
I also raised a Github issue about it issue about it.
Looking forward to your help!

With hub.Module for TF1, the situation is as you say: either the training or the inference graph is instantiated, and there is no good way to import both and share variables between them in a single tf.Session. That's informed by the approach used by Estimators and many other training scripts in TF1 (esp. distributed ones): there's a training Session that produces checkpoints, and a separate evaluation Session that restores model weights from them. (The two will likely also differ in the dataset they read and the preprocessing they perform.)
With TF2 and its emphasis on Eager mode, this has changed. TF2-style Hub modules (as found at https://tfhub.dev/s?q=tf2-preview) are really just TF2-style SavedModels, and these don't come with multiple graph versions. Instead, the __call__ function on the restored top-level object takes an optional training=... parameter if the train/inference distinction is required.
With this, TF2 should match your expectations. See the interactive demo tf2_image_retraining.ipynb and the underlying code in tensorflow_hub/keras_layer.py for how it can be done. The TF Hub team is working on making more complete selection of modules available for the TF2 release.

How to elegantly debug nan/inf on gradients in a Tensorflow model with RNN?

Occassionally we may encounter some nan/inf in gradients during backprop on seq2seq Tensorflow models. How could we easily find the cause of such issue, e.g. by locating the op and time step on which nan/inf is produced?
Since the error occurs on backpropagation, we could not simply observe the gradient values with tf.Print(). Also in a RNN model, tf.add_check_numerics_ops() doesn't work, and we could not use tf.check_numerics() unless we dig into the messy tf libraries or reimplement the control flow manually. While tfdbg, as a general solution, is hard to use and extremely slow on large models.

Can Caffe or Caffe2 be given input data directly from gpu?

I've read caffe2 tutorials and tried pre-trained models. I knew caffe2 will leverge GPU to run the model/net. But the input data seems always be given from CPU(ie. Host) memory. For example, in Loading Pre-Trained Models, after model is loaded, we can predict an image by
result = p.run([img])
However, image "img" should be read in CPU scope. What I look for is a framework that can pipline the images (which is decoded from a video and still resides in GPU memory) directly to the prediction model, instead of copying it from GPU to CPU scope, and then transfering to GPU again to predict result. Is Caffe or Caffe2 provides such functions or interfaces for python or C++? Or should I need to patch Caffe to do so? Thanks at all.
Here is my solution:
I'd found in tensor.h, function ShareExternalPointer() can exactly do what I want.
Feed gpu data this way,
pInputTensor->ShareExternalPointer(pGpuInput, InputSize);
then run the predict net through
pPredictNet->Run();
where pInputTensor is the entrance tensor for the predict net pPredictNet

I don't think you can do it in caffe with python interface.
But I think that it can be accomplished using the c++: In c++ you have access to the Blob's mutable_gpu_data(). You may write code that run on device and "fill" the input Blob's mutable_gpu_data() directly from gpu. Once you made this update, caffe should be able to continue its net->forward() from there.
UPDATE
On Sep 19th, 2017 PR #5904 was merged into master. This PR exposes GPU pointers of blobs via the python interface.
You may access blob._gpu_data_ptr and blob._gpu_diff_ptr directly from python at your own risk.

As you've noted, using a Python layer forces data in and out of the GPU, and this can cause a huge hit to performance. This is true not just for Caffe, but for other frameworks too. To elaborate on Shai's answer, you could look at this step-by-step tutorial on adding C++ layers to Caffe. The example given should touch on most issues dealing with layer implementation. Disclosure: I am the author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.