I'm doing neural style transfer using TensorFlow(with the 19.01 Nvidia TensorFlow Docker image) in Python. I have an Nvidia 2070 graphics card, and I've been getting Out of Memory errors when I try to run a newer version of the TensorFlow docker image(19.08 for example). So I decided that perhaps it is time to consider using 16 bit precision instead of 32 bit for storing the parameters of the VGG19 CNN.
My initial research when I built my machine had led me to believe that switching from 32 to 16 was a cakewalk, but that hasn't been my experience now that I'm actively trying to make the transition.
This is what I have done:
I set tf.keras.backend.set_floatx('float16')
Set tf.keras.backend.set_epsilon(1e-4)
Change my image input to the VGG19 network to a float16, and any other miscellaneous parts of my code that use the float32 datatype in conjunction with the float16.
When I run the code, nvidia-smi still reports that essentially 100% of my GPU is being used. Has anyone had any success with reducing their model memory footprint by switching to float16 in TensorFlow?
TensorFlow has various ways of managing mixed precision. The most suitable mixed precision approach depends on which optimizer you plan to use. Keras optimizers, for example, have an API designed to easily port code one way or the other. It is called mixed_precision.
If you are using a TFv1 optimizer, or one of the other non keras optimizers offered by TensorFlow, you can use their graph rewrite function to convert various pieces of the graph to float 16.
Related
I am using a Nvidia RTX GPU with tensor cores, I want to make sure pytorch/tensorflow is utilizing its tensor cores. I noticed in few articles that the tensor cores are used to process float16 and by default pytorch/tensorflow uses float32. They have introduced some lib that does "mixed precision and distributed training". It is a somewhat old answer. I want to know if pytorch or tensorflow GPU now supports tensor core processing out of the box.
Mixed Precision is available in both libraries.
For pytorch it is torch.cuda.amp, AUTOMATIC MIXED PRECISION PACKAGE.
https://pytorch.org/docs/stable/amp.html
https://pytorch.org/docs/stable/notes/amp_examples.html.
Tensorflow has it here, https://www.tensorflow.org/guide/mixed_precision.
This page is a guide to use apex.amp (Automatic Mixed Precision), a tool to enable Tensor Core-accelerated training in only 3 lines of Python.
You can also check quick start for apex API here
Tensorflow lite predictions are extremely slow compared to keras (h5) model. The behavior is similar between Colab and also on Windows 10 system. I converted the standard VGG16 model to tflite both with and without optimization (converter.optimizations = [tf.lite.Optimize.DEFAULT])
Here are the results I got:
Keras model (540MB) prediction time: 0.14 seconds
tflite without optimization (540MB) prediction time: 0.5 seconds
tflite with optimization (135MB) prediction time: 39 seconds
Am I missing something here? Isn't tflite supposed to be optimized for speed? Would the behavior be different on Raspberry Pi or other 'lighter' devices?
Link to the code on colab
TensorFlow Lite isn't optimized for desktop/server, so its not surprising that it performs badly for most models in those environments. TFLite's optimized kernels (including lot of their GEMM operations) are especially geared for mobile CPUs (which don't have the same instruction set as desktop CPUs IIUC).
Standard TensorFlow is better for your use-case.
I agree with Sachin. TFLite's intended use is mobile devices.
However, if you need faster inference on a desktop or server you can try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). I suggest using an AUTO device. It will choose the best hardware for you. Moreover, if you care about latency, provide that performance hint (as shown below). If you depend on throughput, use THROUGHPUT or CUMULATIVE_THROUGHPUT instead.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?
The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.
One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.
I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
I coded both Python and C++ version of Caffe forward classification scripts to test Caffe's inference performance. The model is trained already. And the results are quite similar, GPU utils is not full enough.
My settings:
1. Card: Titan XP, 12GB
2. Model: InceptionV3
3. Img size: 3*299*299
When batch_size set to 40, GRAM usage can reach 10GB, but the GPU utils can just reach 77%~79%, both for Python and C++. So the performance is about 258 frames/s.
In my scripts, I loaded the image, preprocess it, load it into the input layer, and then repeat the net_.forward() operation. According to my understanding, this won't cause any Mem copy ops, so ideally should maximally pull up the GPU utils. But I can only reach no more than 80%.
In the C++ Classification Tutorial, I found below phrase:
Use multiple classification threads to ensure the GPU is always fully utilized and not waiting for an I/O blocked CPU thread.
So I tried to use the multi-thread compiled OpenBLAS, and under CPU backend, actually more CPU is involved to do the forwarding, but no use for the GPU backend. Under the GPU backend, the CPU utils will be fixed to about 100%.
Then I even tried to reduce the batch_size to 20, and start two classification processes in two terminals. The result is, GRAM usage increases to 11GB, but the GPU utils decrease to 64%~66%. Finally, the performance decreases to around 200 frames/s.
Has anyone encountered this problem? I'm really confused.
Any opinion is welcome.
Thanks,
As I had observed, the GPU util is decreased with,
1) low PCI express mode resnet-152(x16)-90% > resnet-152(x8)-80% > resnet-152(x4)-70%
2) large model - VGG-100%(x16) ; ResNet-50(x16)-95~100% ; ResNet-152(x16) - 90%
In addition, if I turn off cuDNN, the GPU Util is always 100%.
So I think there is some problem related with cuDNN, but I don't know more about the problem.
NVCaffe is somewhat better, and MXNet can utilize GPU 100% (resnet-152; x4).
I've read caffe2 tutorials and tried pre-trained models. I knew caffe2 will leverge GPU to run the model/net. But the input data seems always be given from CPU(ie. Host) memory. For example, in Loading Pre-Trained Models, after model is loaded, we can predict an image by
result = p.run([img])
However, image "img" should be read in CPU scope. What I look for is a framework that can pipline the images (which is decoded from a video and still resides in GPU memory) directly to the prediction model, instead of copying it from GPU to CPU scope, and then transfering to GPU again to predict result. Is Caffe or Caffe2 provides such functions or interfaces for python or C++? Or should I need to patch Caffe to do so? Thanks at all.
Here is my solution:
I'd found in tensor.h, function ShareExternalPointer() can exactly do what I want.
Feed gpu data this way,
pInputTensor->ShareExternalPointer(pGpuInput, InputSize);
then run the predict net through
pPredictNet->Run();
where pInputTensor is the entrance tensor for the predict net pPredictNet
I don't think you can do it in caffe with python interface.
But I think that it can be accomplished using the c++: In c++ you have access to the Blob's mutable_gpu_data(). You may write code that run on device and "fill" the input Blob's mutable_gpu_data() directly from gpu. Once you made this update, caffe should be able to continue its net->forward() from there.
UPDATE
On Sep 19th, 2017 PR #5904 was merged into master. This PR exposes GPU pointers of blobs via the python interface.
You may access blob._gpu_data_ptr and blob._gpu_diff_ptr directly from python at your own risk.
As you've noted, using a Python layer forces data in and out of the GPU, and this can cause a huge hit to performance. This is true not just for Caffe, but for other frameworks too. To elaborate on Shai's answer, you could look at this step-by-step tutorial on adding C++ layers to Caffe. The example given should touch on most issues dealing with layer implementation. Disclosure: I am the author.