What is different between pipeline parallelism and multiprocessing in pytorch?

What is different between pipeline parallelism and multiprocessing in pytorch? - python

I'm trying to train a model contains two sub models with two GPUs simultaneously, and someone told me I should take a look on multiprocessing (refer to this).
As I know, parallelism contains data parallelism and model parallelism, in my case is more likely to use model parallelism, and we usually use pipeline together to reduce the waste of transfering data between different model. This can be implemented by using torch.distributed.pipeline.sync package.
On the other hand, due to GIL in python, if we really want to execute multiple threads (in my case, training two models) at the same time, one way is using torch.multiprocessing based on python's multiprocessing package, which will spawn new intepreter to solve the limitation of GIL.
What I question about are below:
What is difference between parallelism and multiprocessing?
In my case which one (or which package) should I use?
I found two tutorial on pytorch's document related to parallelism
(SINGLE-MACHINE MODEL PARALLEL BEST PRACTICES and TRAINING TRANSFORMER MODELS USING PIPELINE PARALLELISM), but I wonder the former one is actually not doing "parallelism", it just split the model and only one model in calculation in one time.
Thanks for anyone who takes a look of my question.

Related

Where to set n_job: estimator or GridSearchCV?

I often use GridSearchCV for hyperparameter tuning. For example, for tuning regularization parameter C in Logistic Regression. Whenever an estimator I am using has its own n_jobs parameter I am confused where to set it, in estimator or in GridSearchCV, or in both? Same thing applies to cross_validate.

This is a very interesting question. I don't have a definitive answer, but some elements that are worth mentioning to understand the issue, and don't fir in a comment.
Let's start with why you should or should not use multiprocessing :
Multiprocessing is useful for independent tasks. This is the case in a GridSearch, where all your different variations of your models are independent.
Multiprocessing is not useful / make things slower when :
Task are too small : creating a new process takes time, and if your task is really small, this overhead with slow the execution of the whole code
Too many processes are spawned : your computer have a limited number of cores. If you have more processes than cores, a load balancing mechanism will force the computer to regularly switch the processes that are running. These switches take some time, resulting in a slower execution.
The first take-out is that you should not use n_jobs in both GridSearch and the model you're optimizing, because you will spawn a lot of processes and end up slowing the execution.
Now, a lot of sklearn models and functions are based on Numpy/SciPy which in turn, are usually implemented in C/Fortran, and thus already use multiprocessing. That means that these should not be used with n_jobs>1 set in the GridSearch.
If you assume your model is not already parallelized, you can choose to set n_jobsat the model level or at the GridSearch level. A few models are able to be fully parallelized (RandomForest for instance), but most may have at least some part that is sequential (Boosting for instance). In the other end, GridSearch has no sequential component by design, so it would make sense to set n_jobs in GridSearch rather than in the model.
That being said, it depend on the implementation of the model, and you can't have a definitive answer without testing for yourself for your case. For example, if you pipeline consume a lot of memory for some reason, setting n_jobs in the GridSearch may cause memory issues.
As a complement, here is a very interesting note on parallelism in sklearn

Training batches: which Tensorflow method is the right one?

I'm trying to train a very simple neural network to classify samples of data where some classes necessarily succeed others - this is why I decided to let the input data enter the network in batches. Using Tensorflow, apparently you get multiple ways of declaring batches, like tf.data.Dataset.batch (with which I currently train using the Adam Optimizer) and tf.train.batch. Where is the difference? Should the methods be used together or are they exclusive? In the latter case: which one should I prefer?

tf.train.* is an older API, more complex and prone to errors than the tf.data.* one (you need to take care yourself of queues, thread runners, coordinator, etc). For your stated purpose (batching data and feeding it to a model), the two are functionally equivalent, as in both achieve your goal. However, you should consider using tf.data as that's both simpler to use and the currently recommended way to handle input datasets.

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

How to enable Dataset pipeline has distributed reading and consuming

It is easy to use two threads that one keeps feeding data to the queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset as input pipeline after 1.2.0., I would like to use the Dataset and its iterator to accomplish the task above, namely:
There are two processes, one feeds and the other consumes;
The pipeline suspends either it is full or empty and it stops when computation finishes at consuming.
P.S. Why in the tutorial of Threading and Queues, TensorFlow uses thread instead of process?
Thank you in advance.

Distributed tf.contrib.data pipelines are not yet supported as of TensorFlow 1.3. We are working on support for splitting datasets across devices and/or processes, but that support is not yet ready.
In the meantime, the easiest way to achieve your goal is to use a tf.FIFOQueue. You can define a Dataset that reads from a queue as follows:
q = tf.FIFOQueue(...)
# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)
dataset_from_queue = dummy.map(lambda _: q.dequeue())
You can then compose other Dataset tranformations with dataset_from_queue.

Most efficient way to save best performing TensorFlow model on validation set while training with thread for data loading

OK, it's so easy in Torch ML ;) and I am following indico example for threading to load the data- https://indico.io/blog/tensorflow-data-input-part2-extensions/
So, for I found three ways, which I don't like and I am sure there is a better way.
1) Train and evaluate\validated on two different application\app\run- tensorflow/models/image/cifar10/cifar10_train.py and cifar10_eval.py
I don't like this one because I will waste resources i.e. GPUs where cifar10_eval.py will run. I can do this both from one file or application but don't like to save if model is not the best performing model!
2) Create validation model with weight sharing- tensorflow/models/image/mnist/convolutional.py
Much better but I dont like the fact that I need to remember all the model parameters, I am sure there is a better way to share parameters in TensorFlow i.e. can I just copy the model and say it's for parameters sharing but input feeds are different?
3) The one currently I am doing is using tf.placeholder
But can't do threading things i.e. tf.RandomShuffleQueue with this approach. May be I don't know how to do via this approach.
So, how could I do, threading to load train data and do one epoch of training then use these weights and again do threading to load validation data and get the model performance?
Basically, I am saying multi-threads to load train and valid data and save the best peforming model. Example EXACTLY similar to imagenet multi GPU training in torch- https://github.com/soumith/imagenet-multiGPU.torch
Thank you so much!

The variable-sharing approach is probably the easiest way to do what you want.
Take a look at the "Sharing Variables" tutorial; by using tf.variable_scope() and tf.get_variable() you can reuse variables without having to manage the sharing explicitly. You can instead define the model in a function, call it with different arguments, but share the model variables between the two calls.
There are also convenience layers that wrap Tensorflow's variable management. One option is Tensorflow Slim, which makes it easier to define some classes of models (especially convolutional models).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.