I'm trying to run the CIFAR10 tutorial with the training code on one gpu and the eval code on the other. I know for sure I have two gpus on my computer, and I can test this by running the simple examples here: https://www.tensorflow.org/how_tos/using_gpu/index.html
However, using a with device('/gpu:0') does not work for most variables in the CIFAR example. I tried a whole lot of combinations of different variables on gpu vs. cpu, or all the variables on one or the other. Always the same error for some variable, something like this:
Cannot assign a device to node 'shuffle_batch/random_shuffle_queue': Could not satisfy explicit device specification '/gpu:0'
Is this possibly a bug in Tensor Flow or am I missing something?
Could not satisfy explicit device specification means you do not have the corresponding device. Do you actually have a CUDA-enabled GPU on your machine?
UPDATE: As it turned out in the discussion below, this error is also raised if the particular operation (in this case, RandomShuffleQueue) cannot be executed on the GPU, because it only has a CPU implementation.
If you are fine with TensorFlow choosing a device for you (particularly, falling back to CPU when no GPU implementation is available), consider setting allow_soft_placement in your configuration, as per this article:
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))
Related
I have been trying to optimize some Tensorflow code that was pretty memory inefficient (use of large dense tensors containing very sparse information), and would thus limit batch size and scalability, by trying to make use of SparseTensors.
After some struggle I finally come up with a decent solution with satisfactory speedup on CPU and very low memory usage, and when the time comes to use a GPU I realize that the previous memory inefficient is orders of magnitude faster...
Using tensorboard profiling I've discovered that two of the operations I have used in my ""optimized"" version only run on CPU (namely UniqueV2 and sparse_dense_matmul), but I could not see any hint of that in the documentation.
The only related piece of documentation states:
If a TensorFlow operation has no corresponding GPU implementation,
then the operation falls back to the CPU device. For example, since
tf.cast only has a CPU kernel, on a system with devices CPU:0 and
GPU:0, the CPU:0 device is selected to run tf.cast, even if requested
to run on the GPU:0 device.
In turn there is nothing in the tf.cast documentation hinting that the op has no GPU kernel.
Thus, is there a simple way to know whether a TF ops has a registered GPU kernel, without having to use a GPU to find it out?
The custom ops guide suggest that this could be seen by looking at the ops C files, but this seems a rather cumbersome way to do it...
I'm using TF v2.8
Thanks!
I recently built my first TensorFlow model (converted from hand-coded python). I'm using tensorflow-gpu, but I only want to use GPU for backprop during training. For everything else I want to use CPU. I've seen this article showing how to force CPU use on a system that will use GPU by default. However, you have to specify every single operation where you want to force CPU use. Instead I'd like to do the opposite. I'd like to default to CPU use, but then specify GPU just for the backprop that I do during training. Is there a way to do that?
Update
Looks like things are just going to run slower over tensorflow because of how my model and scenario are built at present. I tried using a different environment that just uses regular (non-gpu) tensorflow, and it still runs significantly slower than hand-coded python. The reason for this, I suspect, is it's a reinforcement learning model that plays checkers (see below) and makes one single forward prop "prediction" at a time as it plays against a computer opponent. At the time I designed the architecture, that made sense. But it's not very efficient to do predictions one at a time, and less so with whatever overhead there is for tensorflow.
So, now I'm thinking that I'm going to need to change the game playing architecture to play, say, a thousand games simultaneously and run a thousand forward prop moves in a batch. But, man, changing the architecture now is going to be tricky at best.
TensorFlow lets you control device placement with the tf.device context manager.
So for example to run some code on the CPU do
with tf.device('cpu:0'):
<your code goes here>
Similarly to force GPU usage.
Instead of always running your forward pass on the CPU though you're better off making two graphs: a forward-only cpu-only graph to be used when rolling out the policy and a gpu-only forward-and-backward graph to be used when training.
So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":
...
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d): # <---- manual device placement
...
Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.
To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?
Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)
so when you run
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)
you get
Cannot assign a device for operation <OP-NAME>. Operation was
explicitly assigned to <DEVICE-NAME> but available devices are
[<AVAILABLE-DEVICE-0>,...]
An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.
Thus my question is:
when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?
if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?
e.g.
with tf.device('available-gpu-3')
where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU
This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....
I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.
In my own code, it looks like
builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()
If you only have a exported model, seems not possible. You can refer to this issue.
I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.
Is there a way to reliably enable CUDA on the whole model?
I want to run the training on my GPU. I found on some forums that I need to apply .cuda() on anything I want to use CUDA with (I've applied it to everything I could without making the program crash). Surprisingly, this makes the training even slower.
Then, I found that you could use this torch.set_default_tensor_type('torch.cuda.FloatTensor') to use CUDA. With both enabled, nothing changes. What is happening?
You can use the tensor.to(device) command to move a tensor to a device.
The .to() command is also used to move a whole model to a device, like in the post you linked to.
Another possibility is to set the device of a tensor during creation using the device= keyword argument, like in t = torch.tensor(some_list, device=device)
To set the device dynamically in your code, you can use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to set cuda as your device if possible.
There are various code examples on PyTorch Tutorials and in the documentation linked above that could help you.
With both enabled, nothing changes.
That is because you have already set every tensor to GPU.
Is there a way to reliably enable CUDA on the whole model?
model.to('cuda')
I've applied it to everything I could
You only need to apply it to tensors the model will be interacting with, generally:
the model's pramaters model.to('cuda')
the features data features = features.to('cuda')
the target data targets = targets.to('cuda')
I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.
MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.
The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.