Training a RandomForest is slow on a computing cluster

Training a RandomForest is slow on a computing cluster - python

I have an account to a computing cluster than runs on Linux. I'm using scikit-learn to train a Random Forest classifier with 1000 trees on a very large dataset. I tried to use all the cores of the computing cluster by running the following code:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
clf.fit(data, Y)
However when I run the code I see that only 1.2% of the CPUs are being used! So why it's not using all the cores that exists? And how to solve this please?
Edit: I saw that my problem could be relevant to the one in this link, but I wasn't able to understand the solution. https://github.com/scikit-learn/scikit-learn/issues/1053

This might not be the root of your problem (as n_jobs=-1 should automatically detect and use all cores in your master node) but Sklearn will run in parallel in all cores of a single machine in your cluster. By default it will NOT run on cores from different machines in your cluster as this would imply knowing about the architecture of your cluster and communicating via the network which sklearn doesn't know how to do, as it varies from cluster to cluster.
For this you will have to use a solution like ipython parallel. See the excellent tutorial by Oliver Grisel if you want to use the full power of your cluster.
I recommend that you update sklearn to the latest version, try your code locally(ideally under the same OS, sklearn version), debug the scaling behavior and CPU utilization by setting n_jobs=1,2,3... and benchmarking the fit. For example if n_jobs=1 doesn't have a high utilization rate in one core in the cluster but it does in your local PC, this would indicate a problem with the cluster and not with the code. Sometimes the top command in a cluster behaves differently, you should consult this with the admin.

Related

How to parallelize work on an Azure ML Service Compute cluster?

I am able to submit jobs to Azure ML services using a compute cluster. It works well, and the autoscaling combined with good flexibility for custom environments seems to be exactly what I need. However, so far all these jobs seem to only use one compute node of the cluster. Ideally I would like to use multiple nodes for a computation, but all methods that I see rely on rather deep integration with azure ML services.
My modelling case is a bit atypical. From previous experiments I identified a group of architectures (pipelines of preprocessing steps + estimators in Scikit-learn) that worked well.
Hyperparameter tuning for one of these estimators can be performed reasonably fast (couple of minutes) with RandomizedSearchCV. So it seems less effective to parallelize this step.
Now I want to tune and train this entire list of architectures.
This should be very easily to parallelize since all architectures can be trained independently.
Ideally I would like something like (in pseudocode)
tuned = AzurePool.map(tune_model, [model1, model2,...])
However, I could not find any resources on how I could achieve this with an Azure ML Compute cluster.
An acceptable alternative would come in the form of a plug-and-play substitute for sklearn's CV-tuning methods, similar to the ones provided in dask or spark.

There are a number of ways you could tackle this with AzureML. The simplest would be to just launch a number of jobs using the AzureML Python SDK (the underlying example is taken from here)
from azureml.train.sklearn import SKLearn
runs = []
for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
for penalty in [0.5, 1, 1.5]:
print ('submitting run for kernel', kernel, 'penalty', penalty)
script_params = {
'--kernel': kernel,
'--penalty': penalty,
}
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
runs.append(experiment.submit(estimator))
The above requires you to factor your training out into a script (or a set of scripts in a folder) along with the python packages required. The above estimator is a convenience wrapper for using Scikit Learn. There are also estimators for Tensorflow, Pytorch, Chainer and a generic one (azureml.train.estimator.Estimator) -- they all differ in the Python packages and base docker they use.
A second option, if you are actually tuning parameters, is to use the HyperDrive service like so (using the same SKLearn Estimator as above):
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
param_sampling = RandomParameterSampling( {
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
}
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
hyperdrive_run = experiment.submit(hyperdrive_run_config)
Or you could use DASK to schedule the work as you were mentioning. Here is a sample of how to set up DASK on and AzureML Compute Cluster so you can do interactive work on it: https://github.com/danielsc/azureml-and-dask

there's also a ParallelTaskConfiguration Class with a worker_count_per_node setting, which defaults to 1.

How to make scikit-learn Nearest Neighbors algorithm run faster?

I am trying to run a text-based recommendation system to find the category of a part from a file of about 56K parts:
Eg: Copper tube -> Wire,
Television -> Electronics etc
However, it's taking about 4 hours for getting the recommender system output when running in my system of 8GB RAM. I tried running the same script on a ram about 32 GB but there is no improvement in the computation time, which is still taking 4 hours. The training set for the recommender system is about 11k.
How can I make my recommender system run faster? It seems the script is not making use of the memory effectively. Any help will be greatly appreciated.
NB: The example shown is just for illustration and the original data set is much more complicated.
from sklearn.neighbors import NearestNeighbors
#Code for recommendation system
def recommendhts(x,model,train):
distance,index=model.kneighbors(x.toarray(),n_neighbors=1)
mi=distance.argmax()
idx=index[mi][0]
return(train.iloc[idx]['sHTS'],distance[0][0])
#Training the model of training set
train=pd.read_csv('train0207190144.csv')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train['keywords'])
x=X.toarray()
df=pd.DataFrame(x,columns=vectorizer.get_feature_names())
model=NearestNeighbors(metric='correlation',n_neighbors=1)
model.fit(df)
vect=vectorizer.fit(train['keywords'])
#Fitting the Count vectoriser on keywords(product description to be queried)
x_new=vect.transform(product['keywords'])
for i in range(len(product)):
key=x_new[i]
output,probability=recommendhts(key,model,train)
Edit:
I am attaching the snapshot of the result of profiling Code profiling results as suggested in the comments. I ran it for a sample of 1000 rows and the time taken was about 1085 seconds.

First you definitely need to profile your code. I would recommend making use of the %prun magic command in IPython/Jupyter for profiling your script.
Couple of other things to try
Set the 'n_jobs' parameter to allow for parallelism when making predictions.
# setting n_jobs=2 will use 2 cores; setting n_jobs=-1 will use all cores
model=NearestNeighbors(metric='correlation',n_neighbors=1, n_jobs=2)
Unclear to me that re-fitting the vectorizer is necessary.
vect=vectorizer.fit(train['keywords']) # can be removed?
Finally, you should be able to vectorize the predictions and replace the for loop but this would require refactoring your recommendation system and I can not help with that without more info.

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

I am using AWS Sagemaker for model training and deployment, this is sample example for model training
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
The docker image mentioned here is a tensorflow system.
Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.
I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.

Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.
But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.

Scikit-learn machine learning models training using multiple CPUs

I want to decrease training time of my models by using a high end EC2 instance. So I tried c5.18xlarge instance with 2 CPUs and run a few models with parameter n_jobs=-1 but I noticed that only one CPU was utilized:
Can I somehow make Scikit-learn to use all CPUs?

Try adding:
import multiprocessing
multiprocessing.set_start_method('forkserver')
at the top of your code, before running or importing anything. That's a well-known issue with multiprocessing in python.

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.