I don't have a GPU on my machine, since most of the performance recommondations on tensorflow mention only GPU, can someone confirm that e.g.
tf.data.prefetch
tf.distribute.mirroredstrategy
tf.distribute.multiworkerstrategy
Will only work with multi GPU ?
I tried it on my PC and most of the functions realy slow down the process instead of increasing it. Therefore multi CPU is no benefit here?
In case you haven't solved your problem yet, you can use Google Colab (https://colab.research.google.com) to get a GPU - there you can change a runtime to GPU or TPU.
I did not understand exactly what you are asking but let me give you a 10,000ft explanation of those. It might help you to understand what/when you should use it.
tf.data.prefetch : let's suppose that your have 2 steps while training your model. a) read data, b) process the data. While you are processing the data, you could be reading more data to make sure it is available once 'training' is done with the current batch of data. Just think about a producer/consumer model. You don't want your consumer idle while you are producing more data.
tf.distribute.mirroredstrategy : this one helps if you have a single machine with more than one GPU. It allows to train a model in "parallel" on the same machine.
tf.distribute.multiworkerstrategy : let's suppose now that you have a cluster with 5 machines. You could train your model in a distributed fashion using all of them.
This is just a simple explanation of those 3 items you mentioned here.
Related
I am surprised that I could not see a concise way to run a tf.data API on the GPU. I understand that the data pipelines can run on CPU so that it can happen in parallel (with pre-fetching), allowing the GPU to run actual model and train it.
However, my pre-processing is extremely parallel and computationally intensive. While I can technically write the pre-processing as the first layer in my model, I would really not prefer to do this to prevent training data leakage into my model.
Any pointers is appreciated for this. The closest I found was https://towardsdatascience.com/overcoming-data-preprocessing-bottlenecks-with-tensorflow-data-service-nvidia-dali-and-other-d6321917f851, which involves using nvidia DALI frameworks.
Here are some critical points:
I have already tried enforcing device placement with tf.device('...').
I dont want to prefectch the data into the device but rather run the whole data pipeline on GPU.
Preferably, if my computation is more, I want to save my dataset as a tfrecords, so that I can load it directly. This can be done with tf.data.experimental.save for now, but it again it uses CPU!
I'm training a mask-r-cnn network, which is built on tensorflow and keras. I'm searching for a way to reduce training time, so I thought implementing it with tensorflow-distributed.
I've been working with mask-r-cnn for some time, but it seems what I'm trying to do will require me to modify the source code of mask-r-cnn, which is above my current skills.
So, my question is, has someone ever done it, or something similar? is it possible at all, or am I misunderstand the use of tensorflow-distributed.
Thanks ahead.
Even without distributed training, the tensorpack implementation of Mask R-CNN in tensorflow runs 5x faster (and more accurate as well) than the one you linked to.
It also supports distributed training with MPI.
I recently built my first TensorFlow model (converted from hand-coded python). I'm using tensorflow-gpu, but I only want to use GPU for backprop during training. For everything else I want to use CPU. I've seen this article showing how to force CPU use on a system that will use GPU by default. However, you have to specify every single operation where you want to force CPU use. Instead I'd like to do the opposite. I'd like to default to CPU use, but then specify GPU just for the backprop that I do during training. Is there a way to do that?
Update
Looks like things are just going to run slower over tensorflow because of how my model and scenario are built at present. I tried using a different environment that just uses regular (non-gpu) tensorflow, and it still runs significantly slower than hand-coded python. The reason for this, I suspect, is it's a reinforcement learning model that plays checkers (see below) and makes one single forward prop "prediction" at a time as it plays against a computer opponent. At the time I designed the architecture, that made sense. But it's not very efficient to do predictions one at a time, and less so with whatever overhead there is for tensorflow.
So, now I'm thinking that I'm going to need to change the game playing architecture to play, say, a thousand games simultaneously and run a thousand forward prop moves in a batch. But, man, changing the architecture now is going to be tricky at best.
TensorFlow lets you control device placement with the tf.device context manager.
So for example to run some code on the CPU do
with tf.device('cpu:0'):
<your code goes here>
Similarly to force GPU usage.
Instead of always running your forward pass on the CPU though you're better off making two graphs: a forward-only cpu-only graph to be used when rolling out the policy and a gpu-only forward-and-backward graph to be used when training.
i have the problem to understand where tensorflow executes the variables. In my case, i have a large word embedding matrix which is used by a RNN to generate text. During the text generation process, there is a lookup of the embeddings and i know that this lookup needs to be executed on CPU because GPU doesn't support it.
I want to expand my system so that there are calculations with the large embedding matrix, but this operation is very slow. I think this will also be executed on cpu, although a calculation on GPU is possible. When i loop into the tool GPU-Z during the calculation, i can see that the most time the bus interface load is very high (>60%) and the GPU load is very low (<10%).
It is very difficult to post a minimal example, i hope the problem is clear. I don't know how to debug this. Are the embedding automatically placed on the cpu because of the lookup operation? Do you have any idea how to overcome this problem?
I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.
MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.
The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.