I often use GridSearchCV for hyperparameter tuning. For example, for tuning regularization parameter C in Logistic Regression. Whenever an estimator I am using has its own n_jobs parameter I am confused where to set it, in estimator or in GridSearchCV, or in both? Same thing applies to cross_validate.
This is a very interesting question. I don't have a definitive answer, but some elements that are worth mentioning to understand the issue, and don't fir in a comment.
Let's start with why you should or should not use multiprocessing :
Multiprocessing is useful for independent tasks. This is the case in a GridSearch, where all your different variations of your models are independent.
Multiprocessing is not useful / make things slower when :
Task are too small : creating a new process takes time, and if your task is really small, this overhead with slow the execution of the whole code
Too many processes are spawned : your computer have a limited number of cores. If you have more processes than cores, a load balancing mechanism will force the computer to regularly switch the processes that are running. These switches take some time, resulting in a slower execution.
The first take-out is that you should not use n_jobs in both GridSearch and the model you're optimizing, because you will spawn a lot of processes and end up slowing the execution.
Now, a lot of sklearn models and functions are based on Numpy/SciPy which in turn, are usually implemented in C/Fortran, and thus already use multiprocessing. That means that these should not be used with n_jobs>1 set in the GridSearch.
If you assume your model is not already parallelized, you can choose to set n_jobsat the model level or at the GridSearch level. A few models are able to be fully parallelized (RandomForest for instance), but most may have at least some part that is sequential (Boosting for instance). In the other end, GridSearch has no sequential component by design, so it would make sense to set n_jobs in GridSearch rather than in the model.
That being said, it depend on the implementation of the model, and you can't have a definitive answer without testing for yourself for your case. For example, if you pipeline consume a lot of memory for some reason, setting n_jobs in the GridSearch may cause memory issues.
As a complement, here is a very interesting note on parallelism in sklearn
Related
In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1 both in model specification and cross validation function in order to take full advantage of the power of the machine?
For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:
RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)
RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)
is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?
I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate function is from sklearn.
Specifying n_jobs twice does have an effect, though whether it has a positive or negative effect is complicated.
When you specify n_jobs twice, you get two levels of parallelism. Imagine you have N cores. The cross-validation function creates N copies of your model. Each model creates N threads to run fitting and predictions. You then have N*N threads.
This can blow up pretty spectacularly. I once worked on a program which needed to apply ARIMA to tens of thousands of time-series. Since each ARIMA is independent, I parallelized it and ran one ARIMA on each core of a 12-core CPU. I ran this, and it performed very poorly. I opened up htop, and was surprised to find 144 threads running. It turned out that this library, pmdarima, internally parallelized ARIMA operations. (It doesn't parallelize them well, but it does try.) I got a massive speedup just by turning off this inner layer of parallelism. Having two levels of parallelism is not necessarily better than having one.
In your specific case, I benchmarked a random forest with cross validation, and I benchmarked four configurations:
No parallelism
Parallelize across different CV folds, but no model parallelism
Parallelize within the model, but not on CV folds
Do both
(Error bars represent 95% confidence interval. All tests used RandomForestClassifier. Test was performed using cv=5, 100K samples, and 100 trees. Test system had 4 cores with SMT disabled. Scores are mean duration of 7 runs.)
This graph shows that no parallelism is the slowest, CV parallelism is third fastest, and model parallelism and combined parallelism are tied for first place.
However, this is closely tied to what classifiers I'm using - a benchmark for pmdarima, for example, would find that cross-val parallelism is faster than model parallelism or combined parallelism. If you don't know which one is faster, then test it.
I'm trying to train a model contains two sub models with two GPUs simultaneously, and someone told me I should take a look on multiprocessing (refer to this).
As I know, parallelism contains data parallelism and model parallelism, in my case is more likely to use model parallelism, and we usually use pipeline together to reduce the waste of transfering data between different model. This can be implemented by using torch.distributed.pipeline.sync package.
On the other hand, due to GIL in python, if we really want to execute multiple threads (in my case, training two models) at the same time, one way is using torch.multiprocessing based on python's multiprocessing package, which will spawn new intepreter to solve the limitation of GIL.
What I question about are below:
What is difference between parallelism and multiprocessing?
In my case which one (or which package) should I use?
I found two tutorial on pytorch's document related to parallelism
(SINGLE-MACHINE MODEL PARALLEL BEST PRACTICES and TRAINING TRANSFORMER MODELS USING PIPELINE PARALLELISM), but I wonder the former one is actually not doing "parallelism", it just split the model and only one model in calculation in one time.
Thanks for anyone who takes a look of my question.
I can't understand how the n_jobs works :
data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)
runs in less than 1sec
with n_jobs = 2, it runs nearly twice as much
with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)
Is there something I don't understand with how parallelization works ?
n_jobs specifies the number of concurrent processes/threads should be used for parallelized routines
From docs
Some parallelism uses a multi-threading backend by default, some a
multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.
With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).
Again from the docs:
Whether parallel processing is helpful at improving runtime depends on
many factors, and it’s usually a good idea to experiment rather than
assuming that increasing the number of jobs is always a good thing. It
can be highly detrimental to performance to run multiple copies of
some estimators or functions in parallel.
So n_jobs is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.
You can use n_jobs=-1 to use all your CPUs or n_jobs=-2 to use all of them except one.
While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.
I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.
They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.
So, my question is:
Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?
TL;DR
Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar to the other major toolkits.
During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.
That, but much longer
When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.
When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.
The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order in which thread ended first.
From there, you have broadly speaking two options:
Keep non-determinism associated with simpler implementations.
Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms
Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.
Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.
If more, sometimes we will select some implementations that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementation, e.g. when we do not have a GPU implementation that is deterministic. Also, see the dnn.conv.algo* flags to cover more cases.
In TensorFlow, the realization of the need for determinism has been rather late, but it's slowly getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.
It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)
Is this important?
If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.
Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)
Those reasons are rightful motivations to fix non-determinism in neural networks.
For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.
Starting from TF 2.9 (TF >= 2.9), if you want your TF models to run deterministically, the following lines need to be added at the beginning of the program.
import tensorflow as tf
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism()
Important note: The first line sets the random seed for the following : Python, NumPy and TensorFlow. The second line makes each TensorFlow operation deterministic.
To get a MNIST network (https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) to train deterministically on my GPU (1050Ti):
Set PYTHONHASHSEED='SOMESEED'. I do it before starting the python kernel.
Set seeds for random generators (not sure all are needed for MNIST)
python_random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)
Make TF select deterministic GPU algorithms
Either:
import tensorflow as tf
from tfdeterminism import patch
patch()
Or:
os.environ['TF_CUDNN_DETERMINISTIC']='1'
import tensorflow as tf
Note that the resulting loss is repeatable with either method for selecting deterministic algorithms from TF, but the two methods result in different losses. Also, the solution above doesn't make a more complicated model I'm using repeatable.
Check out https://github.com/NVIDIA/framework-determinism for a more current answer.
A side note:
For cuda cuDNN 8.0.1, non deterministic algorithms exist for:
(from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html)
cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
is used
cudnnPoolingBackward when CUDNN_POOLING_MAX is used
cudnnSpatialTfSamplerBackward
cudnnCTCLoss and cudnnCTCLoss_v8 when
CUDNN_CTC_LOSS_ALGO_NON_DETERMINSTIC is used
I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.
MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.
The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.