I've read caffe2 tutorials and tried pre-trained models. I knew caffe2 will leverge GPU to run the model/net. But the input data seems always be given from CPU(ie. Host) memory. For example, in Loading Pre-Trained Models, after model is loaded, we can predict an image by
result = p.run([img])
However, image "img" should be read in CPU scope. What I look for is a framework that can pipline the images (which is decoded from a video and still resides in GPU memory) directly to the prediction model, instead of copying it from GPU to CPU scope, and then transfering to GPU again to predict result. Is Caffe or Caffe2 provides such functions or interfaces for python or C++? Or should I need to patch Caffe to do so? Thanks at all.
Here is my solution:
I'd found in tensor.h, function ShareExternalPointer() can exactly do what I want.
Feed gpu data this way,
pInputTensor->ShareExternalPointer(pGpuInput, InputSize);
then run the predict net through
pPredictNet->Run();
where pInputTensor is the entrance tensor for the predict net pPredictNet
I don't think you can do it in caffe with python interface.
But I think that it can be accomplished using the c++: In c++ you have access to the Blob's mutable_gpu_data(). You may write code that run on device and "fill" the input Blob's mutable_gpu_data() directly from gpu. Once you made this update, caffe should be able to continue its net->forward() from there.
UPDATE
On Sep 19th, 2017 PR #5904 was merged into master. This PR exposes GPU pointers of blobs via the python interface.
You may access blob._gpu_data_ptr and blob._gpu_diff_ptr directly from python at your own risk.
As you've noted, using a Python layer forces data in and out of the GPU, and this can cause a huge hit to performance. This is true not just for Caffe, but for other frameworks too. To elaborate on Shai's answer, you could look at this step-by-step tutorial on adding C++ layers to Caffe. The example given should touch on most issues dealing with layer implementation. Disclosure: I am the author.
Related
I am surprised that I could not see a concise way to run a tf.data API on the GPU. I understand that the data pipelines can run on CPU so that it can happen in parallel (with pre-fetching), allowing the GPU to run actual model and train it.
However, my pre-processing is extremely parallel and computationally intensive. While I can technically write the pre-processing as the first layer in my model, I would really not prefer to do this to prevent training data leakage into my model.
Any pointers is appreciated for this. The closest I found was https://towardsdatascience.com/overcoming-data-preprocessing-bottlenecks-with-tensorflow-data-service-nvidia-dali-and-other-d6321917f851, which involves using nvidia DALI frameworks.
Here are some critical points:
I have already tried enforcing device placement with tf.device('...').
I dont want to prefectch the data into the device but rather run the whole data pipeline on GPU.
Preferably, if my computation is more, I want to save my dataset as a tfrecords, so that I can load it directly. This can be done with tf.data.experimental.save for now, but it again it uses CPU!
I've tried to load the pretrained model from one article, but i can't do this becuse i have one GPU-system, but in model it is is explicitly set to use gpu:0 and gpu:1. What can I do to load this model on my pc?
I have ubuntu, python3.7, cuda10, tensorflow 2.0
You need to use gpu:0 only at every place since it has only one gpu. Also, as the code is written to use mutiple GPUs, it needs to be changed at every place where it is configured for multi level processing (data or function). Please refer below for more information on GPU usage:
https://jhui.github.io/2017/03/07/TensorFlow-GPU/
I'm doing neural style transfer using TensorFlow(with the 19.01 Nvidia TensorFlow Docker image) in Python. I have an Nvidia 2070 graphics card, and I've been getting Out of Memory errors when I try to run a newer version of the TensorFlow docker image(19.08 for example). So I decided that perhaps it is time to consider using 16 bit precision instead of 32 bit for storing the parameters of the VGG19 CNN.
My initial research when I built my machine had led me to believe that switching from 32 to 16 was a cakewalk, but that hasn't been my experience now that I'm actively trying to make the transition.
This is what I have done:
I set tf.keras.backend.set_floatx('float16')
Set tf.keras.backend.set_epsilon(1e-4)
Change my image input to the VGG19 network to a float16, and any other miscellaneous parts of my code that use the float32 datatype in conjunction with the float16.
When I run the code, nvidia-smi still reports that essentially 100% of my GPU is being used. Has anyone had any success with reducing their model memory footprint by switching to float16 in TensorFlow?
TensorFlow has various ways of managing mixed precision. The most suitable mixed precision approach depends on which optimizer you plan to use. Keras optimizers, for example, have an API designed to easily port code one way or the other. It is called mixed_precision.
If you are using a TFv1 optimizer, or one of the other non keras optimizers offered by TensorFlow, you can use their graph rewrite function to convert various pieces of the graph to float 16.
I intend to use posenet in python and not in browser, for that I need the model as a frozen graph to do inference on. Is there a way to do that?
I ported Google's tfjs PoseNet to Python over the holidays. The demo apps in the repository automatically download the weights, freeze a graph, and save to a model file. You can grab this model and use in any TF variant.
I wrote a Python version of the multi-person post processing code that uses vectorized scipy/numpy ops to speed a few parts. I have not done exhaustive testing of this part, but with a number of spot checks on various test images against the reference, and using it for some other sources, it seems to be reasonably close to the original, and faster :)
Python + TF at https://github.com/rwightman/posenet-python
I also did a PyTorch conversion at https://github.com/rwightman/posenet-pytorch
And if you happen to be looking for a CoreML port at some point, I started off with the weight conversion code from this project https://github.com/infocom-tpo/PoseNet-CoreML
We currently do not have the frozen graph for inference publicly, however you could download the assets and run them in a Node.js environment.
So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":
...
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d): # <---- manual device placement
...
Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.
To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?
Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)
so when you run
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)
you get
Cannot assign a device for operation <OP-NAME>. Operation was
explicitly assigned to <DEVICE-NAME> but available devices are
[<AVAILABLE-DEVICE-0>,...]
An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.
Thus my question is:
when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?
if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?
e.g.
with tf.device('available-gpu-3')
where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU
This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....
I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.
In my own code, it looks like
builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()
If you only have a exported model, seems not possible. You can refer to this issue.
I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.