implementing mask-r-cnn with tensorflow-distributed

implementing mask-r-cnn with tensorflow-distributed - python

I'm training a mask-r-cnn network, which is built on tensorflow and keras. I'm searching for a way to reduce training time, so I thought implementing it with tensorflow-distributed.
I've been working with mask-r-cnn for some time, but it seems what I'm trying to do will require me to modify the source code of mask-r-cnn, which is above my current skills.
So, my question is, has someone ever done it, or something similar? is it possible at all, or am I misunderstand the use of tensorflow-distributed.
Thanks ahead.

Even without distributed training, the tensorpack implementation of Mask R-CNN in tensorflow runs 5x faster (and more accurate as well) than the one you linked to.
It also supports distributed training with MPI.

Related

tf.data, tf.distribute without GPU

I don't have a GPU on my machine, since most of the performance recommondations on tensorflow mention only GPU, can someone confirm that e.g.
tf.data.prefetch
tf.distribute.mirroredstrategy
tf.distribute.multiworkerstrategy
Will only work with multi GPU ?
I tried it on my PC and most of the functions realy slow down the process instead of increasing it. Therefore multi CPU is no benefit here?

In case you haven't solved your problem yet, you can use Google Colab (https://colab.research.google.com) to get a GPU - there you can change a runtime to GPU or TPU.

I did not understand exactly what you are asking but let me give you a 10,000ft explanation of those. It might help you to understand what/when you should use it.
tf.data.prefetch : let's suppose that your have 2 steps while training your model. a) read data, b) process the data. While you are processing the data, you could be reading more data to make sure it is available once 'training' is done with the current batch of data. Just think about a producer/consumer model. You don't want your consumer idle while you are producing more data.
tf.distribute.mirroredstrategy : this one helps if you have a single machine with more than one GPU. It allows to train a model in "parallel" on the same machine.
tf.distribute.multiworkerstrategy : let's suppose now that you have a cluster with 5 machines. You could train your model in a distributed fashion using all of them.
This is just a simple explanation of those 3 items you mentioned here.

MxNet: Good ways to infer on large image datasets

I have millions of images to infer on. I know how to write my own code to create batches and forward the batches to a trained network using MxNet Module API in order to get the predictions. However, creating the batches leads to a lot of data manipulation that is not especially optimized.
Before doing any optimisation myself, I would like to know if there are some recommended approaches for batch predictions/inferences. More specifically, since this is a common use case, I was wondering if there is an interface/api that can do the usual image pre-processing, batch creation, and inference given a trained model (i.e. symbole file & epoch checkpoint)?

If you are using a standard pretrained model, I would highly recommend to take a look into gluoncv project - a toolkit for Computer Vision based on Apache MXNet.
They have really nice implementations of state of the art models, sometimes even beating the original results that are published in scientific papers. What is cool is that they also provide the data preprocessing code - as far as I understand, this is what you are looking for. (see gluoncv.data.transforms.presets package).
I don't know which inference you want to do, like image classification, segmentation, etc, but take a look to the list of tutorials and most probably you will find one you need.
Other than that, optimization for the fast wall clock time requires you to make sure that your GPU is 100% utilized. You may find useful to watch this video to learn more about tips and tricks on optimizing performance. It discusses training, but the same techniques applies to inference.

What are the sources of non-determinism in Tensorflow parallel computing? [duplicate]

While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.
I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.
They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.
So, my question is:
Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?

TL;DR
Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar to the other major toolkits.
During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.
That, but much longer
When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.
When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.
The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order in which thread ended first.
From there, you have broadly speaking two options:
Keep non-determinism associated with simpler implementations.
Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms
Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.
Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.
If more, sometimes we will select some implementations that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementation, e.g. when we do not have a GPU implementation that is deterministic. Also, see the dnn.conv.algo* flags to cover more cases.
In TensorFlow, the realization of the need for determinism has been rather late, but it's slowly getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.
It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)
Is this important?
If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.
Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)
Those reasons are rightful motivations to fix non-determinism in neural networks.
For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.

Starting from TF 2.9 (TF >= 2.9), if you want your TF models to run deterministically, the following lines need to be added at the beginning of the program.
import tensorflow as tf
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism()
Important note: The first line sets the random seed for the following : Python, NumPy and TensorFlow. The second line makes each TensorFlow operation deterministic.

To get a MNIST network (https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) to train deterministically on my GPU (1050Ti):
Set PYTHONHASHSEED='SOMESEED'. I do it before starting the python kernel.
Set seeds for random generators (not sure all are needed for MNIST)
python_random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)
Make TF select deterministic GPU algorithms
Either:
import tensorflow as tf
from tfdeterminism import patch
patch()
Or:
os.environ['TF_CUDNN_DETERMINISTIC']='1'
import tensorflow as tf
Note that the resulting loss is repeatable with either method for selecting deterministic algorithms from TF, but the two methods result in different losses. Also, the solution above doesn't make a more complicated model I'm using repeatable.
Check out https://github.com/NVIDIA/framework-determinism for a more current answer.
A side note:
For cuda cuDNN 8.0.1, non deterministic algorithms exist for:
(from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html)
cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
is used
cudnnPoolingBackward when CUDNN_POOLING_MAX is used
cudnnSpatialTfSamplerBackward
cudnnCTCLoss and cudnnCTCLoss_v8 when
CUDNN_CTC_LOSS_ALGO_NON_DETERMINSTIC is used

Tensorflow estimator: predict without loading from checkpoint everytime

I am using estimator in Tensorflow(1.8) and python3.6 to build up Neural network for my Reinforcement learning project. And I notice everytime you use estimator.predict(), the tensorflow will load up the checkpoint under the model_dir. But It's extremely inefficient if you have to use this function multiple times for the same checkpoint, e.g. in Reinforcement learning, I may need to predict next action based on current state and next state will be realized only after you choose a specific action. So it's commonplace to call this function thousands of times.
So my question is, how to call this function without loading checkpoint(same checkpoint) everytime.
Thank you.

Well, I think I've just found a good answer to my own question. A good solution to this problem is to construct a tf.dataset by generator. Link is here.
Generator will keep your estimator.predict open and in this way you won't need to keep loading the checkpoint. Only thing you need to do is to change the yielded object in this fastpredict object(self.next_feature in this case) if necessary.
However, I need to mention if your ultimate goal is to make the whole thing a service or something like that. You may need something like tf.serving. So I suggest you go in that way directly. I waste a lot time in the process. So I hope this answer helps you save yours.

Hyper-parameter Optimisation in Cpp?

I need to fit a deep neural network to data coming from a data generating process, think of an AR(5). So I have five features per observation and one y for some large number N observations in each simulation. I am interested only in the root mean squared error of the best performing DNN in each simulation.
Since it's a simulation setting, I have to do a large number of these simulations and within each simulation fit a neural network to the data. The only reasonable way I can think of doing this is fit the DNN via hyper-parameter optimisation given each simulation (dlib's find_min_global will be my optimiser).
Does it make sense to do this exercise in C++ (slow development because I am not proficient) or Python (faster iteration because I am fairly proficient).
From where I am sitting, C++ or Python might not make much of a difference in execution time, because the model has to be compiled each time the optimiser proposes a new hyper-parameter vector (am I wrong here?).
If it is possible to compile once, and test all hyper-parameters between the lower and upper bounds, then C++ would be my go to solution(Is this possible in any of the open source DNN languages?).
If anyone has done this exercise before, please advice.
Thank you all for your help.

See looking at your problem, one way to implement this is to use genetic/evolutionary algorithm. Considering that I understood your problem correctly, you want to sweep through all the hyper-parameters to get the get the best solution.
So, I would recommend using python for this and tensorflow, keras all support this. So this might not be a problem.
Note - If I understood your question differently, then please feel free to correct me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.