I work in an environment in which computational resources are shared, i.e., we have a few server machines equipped with a few Nvidia Titan X GPUs each.
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
The problem with TensorFlow is that, by default, it allocates the full amount of available GPU memory when it is launched. Even for a small two-layer neural network, I see that all 12 GB of the GPU memory is used up.
Is there a way to make TensorFlow only allocate, say, 4 GB of GPU memory, if one knows that this is enough for a given model?
You can set the fraction of GPU memory to be allocated when you construct a tf.Session by passing a tf.GPUOptions as part of the optional config argument:
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The per_process_gpu_memory_fraction acts as a hard upper bound on the amount of GPU memory that will be used by the process on each GPU on the same machine. Currently, this fraction is applied uniformly to all of the GPUs on the same machine; there is no way to set this on a per-GPU basis.
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
https://github.com/tensorflow/tensorflow/issues/1578
For TensorFlow 2.0 and 2.1 (docs):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
For TensorFlow 2.2+ (docs):
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
The docs also list some more methods:
Set environment variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use tf.config.experimental.set_virtual_device_configuration to set a hard limit on a Virtual GPU device.
Here is an excerpt from the Book Deep Learning with TensorFlow
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as it is needed by the process. TensorFlow provides two configuration options on the session to control this. The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations, it starts out allocating very little memory, and as sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.
1) Allow growth: (more flexible)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
The second method is per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. Note: No release of memory needed, it can even worsen memory fragmentation when done.
2) Allocate fixed memory:
To only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Note:
That's only useful though if you truly want to bind the amount of GPU memory available on the TensorFlow process.
For Tensorflow version 2.0 and 2.1 use the following snippet:
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
For prior versions , following snippet used to work for me:
import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)
All the answers above assume execution with a sess.run() call, which is becoming the exception rather than the rule in recent versions of TensorFlow.
When using the tf.Estimator framework (TensorFlow 1.4 and above) the way to pass the fraction along to the implicitly created MonitoredTrainingSession is,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
Similarly in Eager mode (TensorFlow 1.5 and above),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
Edit: 11-04-2018
As an example, if you are to use tf.contrib.gan.train, then you can use something similar to bellow:
tf.contrib.gan.gan_train(........, config=conf)
You can use
TF_FORCE_GPU_ALLOW_GROWTH=true
in your environment variables.
In tensorflow code:
bool GPUBFCAllocator::GetAllowGrowthValue(const GPUOptions& gpu_options) {
const char* force_allow_growth_string =
std::getenv("TF_FORCE_GPU_ALLOW_GROWTH");
if (force_allow_growth_string == nullptr) {
return gpu_options.allow_growth();
}
Tensorflow 2.0 Beta and (probably) beyond
The API changed again. It can be now found in:
tf.config.experimental.set_memory_growth(
device,
enable
)
Aliases:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
References:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
See also:
Tensorflow - Use a GPU: https://www.tensorflow.org/guide/gpu
for Tensorflow 2.0 Alpha see: this answer
All the answers above refer to either setting the memory to a certain extent in TensorFlow 1.X versions or to allow memory growth in TensorFlow 2.X.
The method tf.config.experimental.set_memory_growth indeed works for allowing dynamic growth during the allocation/preprocessing. Nevertheless one may like to allocate from the start a specific-upper limit GPU memory.
The logic behind allocating a specific GPU memory would also be to prevent OOM memory during training sessions. For example, if one trains while opening video-memory consuming Chrome tabs/any other video consumption process, the tf.config.experimental.set_memory_growth(gpu, True) could result in OOM errors thrown, hence the necessity of allocating from the start more memory in certain cases.
The recommended and correct way in which to allot memory per GPU in TensorFlow 2.X is done in the following manner:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
Shameless plug: If you install the GPU supported Tensorflow, the session will first allocate all GPUs whether you set it to use only CPU or GPU. I may add my tip that even you set the graph to use CPU only you should set the same configuration(as answered above:) ) to prevent the unwanted GPU occupation.
And in an interactive interface like IPython and Jupyter, you should also set that configure, otherwise, it will allocate all memory and leave almost none for others. This is sometimes hard to notice.
If you're using Tensorflow 2 try the following:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
For Tensorflow 2.0 this this solution worked for me. (TF-GPU 2.0, Windows 10, GeForce RTX 2070)
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
# allocate 60% of GPU memory
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
set_session(tf.Session(config=config))
this code has worked for me:
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
Well I am new to tensorflow, I have Geforce 740m or something GPU with 2GB ram, I was running mnist handwritten kind of example for a native language with training data containing of 38700 images and 4300 testing images and was trying to get precision , recall , F1 using following code as sklearn was not giving me precise reults. once i added this to my existing code i started getting GPU errors.
TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)
prec = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * prec * recall / (prec + recall)
plus my model was heavy i guess, i was getting memory error after 147, 148 epochs, and then I thought why not create functions for the tasks so I dont know if it works this way in tensrorflow, but I thought if a local variable is used and when out of scope it may release memory and i defined the above elements for training and testing in modules, I was able to achieve 10000 epochs without any issues, I hope this will help..
i tried to train unet on voc data set but because of huge image size, memory finishes. i tried all the above tips, even tried with batch size==1, yet to no improvement. sometimes TensorFlow version also causes the memory issues. try by using
pip install tensorflow-gpu==1.8.0
I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?
Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.
So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":
...
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d): # <---- manual device placement
...
Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.
To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?
Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)
so when you run
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)
you get
Cannot assign a device for operation <OP-NAME>. Operation was
explicitly assigned to <DEVICE-NAME> but available devices are
[<AVAILABLE-DEVICE-0>,...]
An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.
Thus my question is:
when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?
if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?
e.g.
with tf.device('available-gpu-3')
where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU
This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....
I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.
In my own code, it looks like
builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()
If you only have a exported model, seems not possible. You can refer to this issue.
I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.
I coded both Python and C++ version of Caffe forward classification scripts to test Caffe's inference performance. The model is trained already. And the results are quite similar, GPU utils is not full enough.
My settings:
1. Card: Titan XP, 12GB
2. Model: InceptionV3
3. Img size: 3*299*299
When batch_size set to 40, GRAM usage can reach 10GB, but the GPU utils can just reach 77%~79%, both for Python and C++. So the performance is about 258 frames/s.
In my scripts, I loaded the image, preprocess it, load it into the input layer, and then repeat the net_.forward() operation. According to my understanding, this won't cause any Mem copy ops, so ideally should maximally pull up the GPU utils. But I can only reach no more than 80%.
In the C++ Classification Tutorial, I found below phrase:
Use multiple classification threads to ensure the GPU is always fully utilized and not waiting for an I/O blocked CPU thread.
So I tried to use the multi-thread compiled OpenBLAS, and under CPU backend, actually more CPU is involved to do the forwarding, but no use for the GPU backend. Under the GPU backend, the CPU utils will be fixed to about 100%.
Then I even tried to reduce the batch_size to 20, and start two classification processes in two terminals. The result is, GRAM usage increases to 11GB, but the GPU utils decrease to 64%~66%. Finally, the performance decreases to around 200 frames/s.
Has anyone encountered this problem? I'm really confused.
Any opinion is welcome.
Thanks,
As I had observed, the GPU util is decreased with,
1) low PCI express mode resnet-152(x16)-90% > resnet-152(x8)-80% > resnet-152(x4)-70%
2) large model - VGG-100%(x16) ; ResNet-50(x16)-95~100% ; ResNet-152(x16) - 90%
In addition, if I turn off cuDNN, the GPU Util is always 100%.
So I think there is some problem related with cuDNN, but I don't know more about the problem.
NVCaffe is somewhat better, and MXNet can utilize GPU 100% (resnet-152; x4).
I need to train a very large number of Neural Nets using Tensorflow with Python. My neural nets (MLP) are ranging from very small ones (~ 2 Hidden Layers with ~30 Neurons each) to large ones (3-4 Layers with >500 neurons each).
I am able to run all of them sequencially on my GPU, which is fine. But my CPU is almost idling. Additionally I found out, that my CPU is quicker than the GPU for my very small nets (I assume because of the GPU-Overhead etc...). Thats why I want to use both my CPU and my GPU in parallel to train my nets. The CPU should process the smaller networks to the larger ones, and my GPU should process from the larger to the smaller ones, until they meet somewhere in the middle... I thought, this is a good idea :-)
So I just simply start my consumers twice in different processes. The one with device = CPU, the other one with device = GPU. Both are starting and consuming the first 2 nets as expected. But then, the GPU-consumer throws an Exception, that his tensor is accessed/violated by another process on the CPU(!), which I find weird, because it is supposed to run on the GPU...
Can anybody help me, to fully segregate my to processes?
Do any of your networks share operators?
E.g. they use variables with the same name in the same variable_scope which is set to variable_scope(reuse=True)
Then multiple nets will try to reuse the same underlying Tensor structures.
Also check it tf.ConfigProto.allow_soft_placement is set to True or False in your tf.Session. If True you can't be guaranteed that the device placement will be actually executed in the way you intended in your code.