I want to decrease training time of my models by using a high end EC2 instance. So I tried c5.18xlarge instance with 2 CPUs and run a few models with parameter n_jobs=-1 but I noticed that only one CPU was utilized:
Can I somehow make Scikit-learn to use all CPUs?
Try adding:
import multiprocessing
multiprocessing.set_start_method('forkserver')
at the top of your code, before running or importing anything. That's a well-known issue with multiprocessing in python.
Related
I have a GUI that allows the user to classify the images and train the CNN model or run predictions with the existing model. I don't have utilized GPUs on the machine. So, I have set up a CPU multiprocessing module to accelerate the performance and parallel the work. I know that TensorFlow modules must be defined inside functions in order to use in multiple cores.
def create_model(list_of_params_list):
import tensorflow as tf
# Model is defined and starts training here.
def multicore_controller(list_of_params_list):
import multiprocessing
with multiprocessing.Pool() as p:
p.starmap_async(create_model, list_of_params_list)
p.close()
p.join()
So create_function creates multiple CNN models with different parameters and hyperparameters (such as the size of kernels, loss function, etc...). However, on `import tensorflow as tf line, the script stops responding (throws no errors nor visible output) and stays at that line forever. What is the problem and how can I fix it? Thanks in advance.
Edit: I tried on tensorflow version 2.3.X, 2.6.X, 2.7.X, 2.8.X, and 2.9.1 but no changes.
I still couldn't find the reason for the bug. However, you can way around the problem when you spawn your processes instead of forking:
import multiprocessing
ctx = multiprocessing.get_context("spawn")
with ctx.Pool() as p:
p.starmap_async(create_model, list_of_params_list)
p.close()
p.join()
It is still very welcome to acquaint with the problem.
I'm trying to train a model contains two sub models with two GPUs simultaneously, and someone told me I should take a look on multiprocessing (refer to this).
As I know, parallelism contains data parallelism and model parallelism, in my case is more likely to use model parallelism, and we usually use pipeline together to reduce the waste of transfering data between different model. This can be implemented by using torch.distributed.pipeline.sync package.
On the other hand, due to GIL in python, if we really want to execute multiple threads (in my case, training two models) at the same time, one way is using torch.multiprocessing based on python's multiprocessing package, which will spawn new intepreter to solve the limitation of GIL.
What I question about are below:
What is difference between parallelism and multiprocessing?
In my case which one (or which package) should I use?
I found two tutorial on pytorch's document related to parallelism
(SINGLE-MACHINE MODEL PARALLEL BEST PRACTICES and TRAINING TRANSFORMER MODELS USING PIPELINE PARALLELISM), but I wonder the former one is actually not doing "parallelism", it just split the model and only one model in calculation in one time.
Thanks for anyone who takes a look of my question.
Even after setting tf.config.threading.set_inter_op_parallelism_threads(1) and tf.config.threading.set_intra_op_parallelism_threads(1) Keras with Tensorflow CPU (running a simple CNN model fit) on a linux machine is creating too many threads. Whatever I try it seems to be creating 94 threads while going through the fitting epochs. Have tried playing with tf.compat.v1.ConfigProto settings but nothing helps. How do I limit the number of threads?
This is why tensorflow created many threads.
Using the mentioned 2 types of parallelism (inter and intra) you have limited control over the number of threads generated by TensorFlow. The minimum number of threads that you can get by setting these two variables is N, where N is the number of cores on your cpu (I don't know if you use gpu).
intra_op_parallelism_threads = 1
inter_op_parallelism_threads = 1
Even by setting the environment variables OMP_NUM_THREADS and MKL_NUM_THREADS can't help in further reducing the number of threads.
The following discussions suggest that without changing the source code of TensorFlow, it is not possible to reduce the number threads below N.
How can I confine TensorFlow C API to use one and only one thread in total
How to disable Tensorflow's multi-threading?
How to stop TensorFlow from multi-threading
https://github.com/tensorflow/tensorflow/issues/42510
https://github.com/tensorflow/tensorflow/issues/33627
I am using AWS Sagemaker for model training and deployment, this is sample example for model training
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
The docker image mentioned here is a tensorflow system.
Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.
I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.
Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.
But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.
I have an account to a computing cluster than runs on Linux. I'm using scikit-learn to train a Random Forest classifier with 1000 trees on a very large dataset. I tried to use all the cores of the computing cluster by running the following code:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
clf.fit(data, Y)
However when I run the code I see that only 1.2% of the CPUs are being used! So why it's not using all the cores that exists? And how to solve this please?
Edit: I saw that my problem could be relevant to the one in this link, but I wasn't able to understand the solution. https://github.com/scikit-learn/scikit-learn/issues/1053
This might not be the root of your problem (as n_jobs=-1 should automatically detect and use all cores in your master node) but Sklearn will run in parallel in all cores of a single machine in your cluster. By default it will NOT run on cores from different machines in your cluster as this would imply knowing about the architecture of your cluster and communicating via the network which sklearn doesn't know how to do, as it varies from cluster to cluster.
For this you will have to use a solution like ipython parallel. See the excellent tutorial by Oliver Grisel if you want to use the full power of your cluster.
I recommend that you update sklearn to the latest version, try your code locally(ideally under the same OS, sklearn version), debug the scaling behavior and CPU utilization by setting n_jobs=1,2,3... and benchmarking the fit. For example if n_jobs=1 doesn't have a high utilization rate in one core in the cluster but it does in your local PC, this would indicate a problem with the cluster and not with the code. Sometimes the top command in a cluster behaves differently, you should consult this with the admin.