I've tried to load the pretrained model from one article, but i can't do this becuse i have one GPU-system, but in model it is is explicitly set to use gpu:0 and gpu:1. What can I do to load this model on my pc?
I have ubuntu, python3.7, cuda10, tensorflow 2.0
You need to use gpu:0 only at every place since it has only one gpu. Also, as the code is written to use mutiple GPUs, it needs to be changed at every place where it is configured for multi level processing (data or function). Please refer below for more information on GPU usage:
https://jhui.github.io/2017/03/07/TensorFlow-GPU/
Related
I have two models with the exact same architecture, but different weights as the same network is used for two different problems. We're using TF-TRT to optimize the model in order to use it on edge devices.
We'd like to be able to switch from one model to the other as fast as possible. As of now, we load the next model using tf.saved_model.load(), however, this reloads the entire model including the architecture. In order to speed up the process, we'd like to simply load the weights & switch them in the model architecture.
From what I've seen, it is possible in Keras by loading a .w1 file, but we don't have such file after converting to TF-TRT.
I've found out that TRT has a Refitter object but I don't think we can use it in this case.
I'd like to know if it is possible to switch weights of a TF-TRT model, perhaps there is something I'm missing out.
Thank you for your help.
Let's say that I want to fine tune one of the Tensorflow Hub image feature vector modules. The problem arises because in order to fine-tune a module, the following needs to be done:
module = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3", trainable=True, tags={"train"})
Assuming that the module is Resnet50.
In other words, the module is imported with the trainable flag set as True and with the train tag. Now, in case I want to validate the model (perform inference on the validation set in order to measure the performance of the model), I can't switch off the batch-norm because of the train tag and the trainable flag.
Please note that this question has already been asked here Tensorflow hub fine-tune and evaluate but no answer has been provided.
I also raised a Github issue about it issue about it.
Looking forward to your help!
With hub.Module for TF1, the situation is as you say: either the training or the inference graph is instantiated, and there is no good way to import both and share variables between them in a single tf.Session. That's informed by the approach used by Estimators and many other training scripts in TF1 (esp. distributed ones): there's a training Session that produces checkpoints, and a separate evaluation Session that restores model weights from them. (The two will likely also differ in the dataset they read and the preprocessing they perform.)
With TF2 and its emphasis on Eager mode, this has changed. TF2-style Hub modules (as found at https://tfhub.dev/s?q=tf2-preview) are really just TF2-style SavedModels, and these don't come with multiple graph versions. Instead, the __call__ function on the restored top-level object takes an optional training=... parameter if the train/inference distinction is required.
With this, TF2 should match your expectations. See the interactive demo tf2_image_retraining.ipynb and the underlying code in tensorflow_hub/keras_layer.py for how it can be done. The TF Hub team is working on making more complete selection of modules available for the TF2 release.
So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":
...
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d): # <---- manual device placement
...
Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.
To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?
Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)
so when you run
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)
you get
Cannot assign a device for operation <OP-NAME>. Operation was
explicitly assigned to <DEVICE-NAME> but available devices are
[<AVAILABLE-DEVICE-0>,...]
An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.
Thus my question is:
when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?
if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?
e.g.
with tf.device('available-gpu-3')
where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU
This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....
I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.
In my own code, it looks like
builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()
If you only have a exported model, seems not possible. You can refer to this issue.
I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.
Is there a way to reliably enable CUDA on the whole model?
I want to run the training on my GPU. I found on some forums that I need to apply .cuda() on anything I want to use CUDA with (I've applied it to everything I could without making the program crash). Surprisingly, this makes the training even slower.
Then, I found that you could use this torch.set_default_tensor_type('torch.cuda.FloatTensor') to use CUDA. With both enabled, nothing changes. What is happening?
You can use the tensor.to(device) command to move a tensor to a device.
The .to() command is also used to move a whole model to a device, like in the post you linked to.
Another possibility is to set the device of a tensor during creation using the device= keyword argument, like in t = torch.tensor(some_list, device=device)
To set the device dynamically in your code, you can use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to set cuda as your device if possible.
There are various code examples on PyTorch Tutorials and in the documentation linked above that could help you.
With both enabled, nothing changes.
That is because you have already set every tensor to GPU.
Is there a way to reliably enable CUDA on the whole model?
model.to('cuda')
I've applied it to everything I could
You only need to apply it to tensors the model will be interacting with, generally:
the model's pramaters model.to('cuda')
the features data features = features.to('cuda')
the target data targets = targets.to('cuda')
I've read caffe2 tutorials and tried pre-trained models. I knew caffe2 will leverge GPU to run the model/net. But the input data seems always be given from CPU(ie. Host) memory. For example, in Loading Pre-Trained Models, after model is loaded, we can predict an image by
result = p.run([img])
However, image "img" should be read in CPU scope. What I look for is a framework that can pipline the images (which is decoded from a video and still resides in GPU memory) directly to the prediction model, instead of copying it from GPU to CPU scope, and then transfering to GPU again to predict result. Is Caffe or Caffe2 provides such functions or interfaces for python or C++? Or should I need to patch Caffe to do so? Thanks at all.
Here is my solution:
I'd found in tensor.h, function ShareExternalPointer() can exactly do what I want.
Feed gpu data this way,
pInputTensor->ShareExternalPointer(pGpuInput, InputSize);
then run the predict net through
pPredictNet->Run();
where pInputTensor is the entrance tensor for the predict net pPredictNet
I don't think you can do it in caffe with python interface.
But I think that it can be accomplished using the c++: In c++ you have access to the Blob's mutable_gpu_data(). You may write code that run on device and "fill" the input Blob's mutable_gpu_data() directly from gpu. Once you made this update, caffe should be able to continue its net->forward() from there.
UPDATE
On Sep 19th, 2017 PR #5904 was merged into master. This PR exposes GPU pointers of blobs via the python interface.
You may access blob._gpu_data_ptr and blob._gpu_diff_ptr directly from python at your own risk.
As you've noted, using a Python layer forces data in and out of the GPU, and this can cause a huge hit to performance. This is true not just for Caffe, but for other frameworks too. To elaborate on Shai's answer, you could look at this step-by-step tutorial on adding C++ layers to Caffe. The example given should touch on most issues dealing with layer implementation. Disclosure: I am the author.