AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number - python

I am using AWS Sagemaker for model training and deployment, this is sample example for model training
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
The docker image mentioned here is a tensorflow system.
Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.
I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.

Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.
But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.

Related

Can we create an ensemble of deep learning models without increasing the classification time?

I want to improve my ResNet model by creating an ensemble of X number of this model, taking the X best one I have trained. For what I've seen, a technique like bagging will take X time longer to classify an image, which is really not an option in my case.
Is there a way to create an ensemble without increasing the required classifying time? Note that I don't care about increasing the training time, because it only needs to be done one time, compared to the classification which could be made a very large number of time.

There is no magic pill for doing what you want. Extra computation cannot come free.
So one way this can be achieved is by using multiple worker machines to run inference in parallel.
Each model could run on a different machine using tensorflow serving.
For every new inference do the following:
Have a primary machine which takes up the job of running the inference
This primary machine, submits requests to different workers (all of which can run in parallel)
The primary machine collects results from each individual worker, and creates the final output by combining them based upon your ensemble logic.

Depends on the ensembling method; it's an active area of research I suggest you look into, but I'll provide some examples below:
Dropout: trains parts of the model at any given iterations, thus effectively training a multi-NN ensemble
Weights averaging: train X models on X different splits of data to learn different features, average the early-stopped weights (requires advanted treatment)
Lookahead optimizer: automates the above by performing the averaging during training
Parallel weak learners: run X models, but each model taking 1/X the time to process - which can be achieved by e.g. inserting a strides=X convolutional layer at input; best starting bet is at X=2, so you'll average two models' predictions at output, each prediction made in parallel (which can run faster than original single model)
If you have a multi-core CPU, however, multi-model ensembling shouldn't pose much of a problem, as per last bullet, you can run inference concurrently, so inference time shouldn't increase much
More on parallelism: if a single model is large enough, CPU parallelism will no longer help - you'll also need to ensure multiple models can fit in memory (RAM). The alternative then is again a form of downsampling to cut computation

How to make scikit-learn Nearest Neighbors algorithm run faster?

I am trying to run a text-based recommendation system to find the category of a part from a file of about 56K parts:
Eg: Copper tube -> Wire,
Television -> Electronics etc
However, it's taking about 4 hours for getting the recommender system output when running in my system of 8GB RAM. I tried running the same script on a ram about 32 GB but there is no improvement in the computation time, which is still taking 4 hours. The training set for the recommender system is about 11k.
How can I make my recommender system run faster? It seems the script is not making use of the memory effectively. Any help will be greatly appreciated.
NB: The example shown is just for illustration and the original data set is much more complicated.
from sklearn.neighbors import NearestNeighbors
#Code for recommendation system
def recommendhts(x,model,train):
distance,index=model.kneighbors(x.toarray(),n_neighbors=1)
mi=distance.argmax()
idx=index[mi][0]
return(train.iloc[idx]['sHTS'],distance[0][0])
#Training the model of training set
train=pd.read_csv('train0207190144.csv')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train['keywords'])
x=X.toarray()
df=pd.DataFrame(x,columns=vectorizer.get_feature_names())
model=NearestNeighbors(metric='correlation',n_neighbors=1)
model.fit(df)
vect=vectorizer.fit(train['keywords'])
#Fitting the Count vectoriser on keywords(product description to be queried)
x_new=vect.transform(product['keywords'])
for i in range(len(product)):
key=x_new[i]
output,probability=recommendhts(key,model,train)
Edit:
I am attaching the snapshot of the result of profiling Code profiling results as suggested in the comments. I ran it for a sample of 1000 rows and the time taken was about 1085 seconds.

First you definitely need to profile your code. I would recommend making use of the %prun magic command in IPython/Jupyter for profiling your script.
Couple of other things to try
Set the 'n_jobs' parameter to allow for parallelism when making predictions.
# setting n_jobs=2 will use 2 cores; setting n_jobs=-1 will use all cores
model=NearestNeighbors(metric='correlation',n_neighbors=1, n_jobs=2)
Unclear to me that re-fitting the vectorizer is necessary.
vect=vectorizer.fit(train['keywords']) # can be removed?
Finally, you should be able to vectorize the predictions and replace the for loop but this would require refactoring your recommendation system and I can not help with that without more info.

Keras not using full CPU cores for training

I am training a LSTM model on a very huge dataset on my machine using Keras on Tensorflow backend. My machine have 16 cores. While training the model I noticed that the load in all the cores are below 40%.
I have gone through different sources looking for a solution and have tried providing the cores to use in the backend as
config = tf.ConfigProto(device_count={"CPU": 16})
backend.tensorflow_backend.set_session(tf.Session(config=config))
Even after that the load is still the same.
Is this because the model is very small.? It is taking around 5 minutes for an epoch. If it uses full cores the speed can be improved.
How to tell Keras or Tensorflow to use the full available cores i.e 16 cores to train the model.??
I have went through these stackoverflow questions and tried the solutions mentioned there. It didn't help.
Limit number of cores used in Keras

How are you training the model exactly? You might want to look into using model.fit_generator() but with a Keras Sequence object instead of a custom generator. This allows to safely use multiprocessing and will result in all cores being used.
You can checkout the Keras docs for an example.

How long does tensorflow object detection API train.py complete training using CPU only?

I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs

It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?

To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.

Training a RandomForest is slow on a computing cluster

I have an account to a computing cluster than runs on Linux. I'm using scikit-learn to train a Random Forest classifier with 1000 trees on a very large dataset. I tried to use all the cores of the computing cluster by running the following code:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
clf.fit(data, Y)
However when I run the code I see that only 1.2% of the CPUs are being used! So why it's not using all the cores that exists? And how to solve this please?
Edit: I saw that my problem could be relevant to the one in this link, but I wasn't able to understand the solution. https://github.com/scikit-learn/scikit-learn/issues/1053

This might not be the root of your problem (as n_jobs=-1 should automatically detect and use all cores in your master node) but Sklearn will run in parallel in all cores of a single machine in your cluster. By default it will NOT run on cores from different machines in your cluster as this would imply knowing about the architecture of your cluster and communicating via the network which sklearn doesn't know how to do, as it varies from cluster to cluster.
For this you will have to use a solution like ipython parallel. See the excellent tutorial by Oliver Grisel if you want to use the full power of your cluster.
I recommend that you update sklearn to the latest version, try your code locally(ideally under the same OS, sklearn version), debug the scaling behavior and CPU utilization by setting n_jobs=1,2,3... and benchmarking the fit. For example if n_jobs=1 doesn't have a high utilization rate in one core in the cluster but it does in your local PC, this would indicate a problem with the cluster and not with the code. Sometimes the top command in a cluster behaves differently, you should consult this with the admin.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number - python

Related

Can we create an ensemble of deep learning models without increasing the classification time?

How to make scikit-learn Nearest Neighbors algorithm run faster?

Keras not using full CPU cores for training

How long does tensorflow object detection API train.py complete training using CPU only?

Training a RandomForest is slow on a computing cluster

Categories

Resources