How to make scikit-learn Nearest Neighbors algorithm run faster?

How to make scikit-learn Nearest Neighbors algorithm run faster? - python

I am trying to run a text-based recommendation system to find the category of a part from a file of about 56K parts:
Eg: Copper tube -> Wire,
Television -> Electronics etc
However, it's taking about 4 hours for getting the recommender system output when running in my system of 8GB RAM. I tried running the same script on a ram about 32 GB but there is no improvement in the computation time, which is still taking 4 hours. The training set for the recommender system is about 11k.
How can I make my recommender system run faster? It seems the script is not making use of the memory effectively. Any help will be greatly appreciated.
NB: The example shown is just for illustration and the original data set is much more complicated.
from sklearn.neighbors import NearestNeighbors
#Code for recommendation system
def recommendhts(x,model,train):
distance,index=model.kneighbors(x.toarray(),n_neighbors=1)
mi=distance.argmax()
idx=index[mi][0]
return(train.iloc[idx]['sHTS'],distance[0][0])
#Training the model of training set
train=pd.read_csv('train0207190144.csv')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train['keywords'])
x=X.toarray()
df=pd.DataFrame(x,columns=vectorizer.get_feature_names())
model=NearestNeighbors(metric='correlation',n_neighbors=1)
model.fit(df)
vect=vectorizer.fit(train['keywords'])
#Fitting the Count vectoriser on keywords(product description to be queried)
x_new=vect.transform(product['keywords'])
for i in range(len(product)):
key=x_new[i]
output,probability=recommendhts(key,model,train)
Edit:
I am attaching the snapshot of the result of profiling Code profiling results as suggested in the comments. I ran it for a sample of 1000 rows and the time taken was about 1085 seconds.

First you definitely need to profile your code. I would recommend making use of the %prun magic command in IPython/Jupyter for profiling your script.
Couple of other things to try
Set the 'n_jobs' parameter to allow for parallelism when making predictions.
# setting n_jobs=2 will use 2 cores; setting n_jobs=-1 will use all cores
model=NearestNeighbors(metric='correlation',n_neighbors=1, n_jobs=2)
Unclear to me that re-fitting the vectorizer is necessary.
vect=vectorizer.fit(train['keywords']) # can be removed?
Finally, you should be able to vectorize the predictions and replace the for loop but this would require refactoring your recommendation system and I can not help with that without more info.

Related

Parallel Logistic Regression with GPU in Python (MinPy)

I just managed to set up an Anaconda Environment to run this example from MinPy.
Now as I understand it, the parallelization is in the training and predicting part.
However, for my specific use case I want to go one level higher:
Since I have one very large data set that is split up in fairly small ones I want to do many multinomial regressions for each of the subsets.
Is there a way I can parallelize at this high of a level?
Since I am very new to the topic I simply do not know how to approach this problem.
Thanks in advance for any advice :)

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

I am using AWS Sagemaker for model training and deployment, this is sample example for model training
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
The docker image mentioned here is a tensorflow system.
Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.
I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.

Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.
But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

How to make Naive Bayes classifier for large datasets in python

I have large datasets of 2-3 GB. I am using (nltk) Naive bayes classifier using the data as train data. When I run the code for small datasets, it runs fine but when run for large datasets it runs for a very long time(more than 8 hours) and then crashes without much of an error. I believe it is because of memory issue.
Also, after classifying the data I want the classifier dumped into a file so that it can be used later for testing data. This process also takes too much time and then crashes as it loads everything into memory first.
Is there a way to resolve this?
Another question is that is there a way to parallelize this whole operation i.e. parallelize the classification of this large dataset using some framework like Hadoop/MapReduce?

I hope you must increase the memory dynamically to overcome this problem. I hope this link will helps you
Python Memory Management
Parallelism in Python

Training a RandomForest is slow on a computing cluster

I have an account to a computing cluster than runs on Linux. I'm using scikit-learn to train a Random Forest classifier with 1000 trees on a very large dataset. I tried to use all the cores of the computing cluster by running the following code:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
clf.fit(data, Y)
However when I run the code I see that only 1.2% of the CPUs are being used! So why it's not using all the cores that exists? And how to solve this please?
Edit: I saw that my problem could be relevant to the one in this link, but I wasn't able to understand the solution. https://github.com/scikit-learn/scikit-learn/issues/1053

This might not be the root of your problem (as n_jobs=-1 should automatically detect and use all cores in your master node) but Sklearn will run in parallel in all cores of a single machine in your cluster. By default it will NOT run on cores from different machines in your cluster as this would imply knowing about the architecture of your cluster and communicating via the network which sklearn doesn't know how to do, as it varies from cluster to cluster.
For this you will have to use a solution like ipython parallel. See the excellent tutorial by Oliver Grisel if you want to use the full power of your cluster.
I recommend that you update sklearn to the latest version, try your code locally(ideally under the same OS, sklearn version), debug the scaling behavior and CPU utilization by setting n_jobs=1,2,3... and benchmarking the fit. For example if n_jobs=1 doesn't have a high utilization rate in one core in the cluster but it does in your local PC, this would indicate a problem with the cluster and not with the code. Sometimes the top command in a cluster behaves differently, you should consult this with the admin.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.