I've been tasked with solving a sentiment classification problem using scikit-learn, python, and mapreduce. I need to use mapreduce to parallelize the project, thus creating multiple SVM classifiers. I am then supposed to "average" the classifiers together, but I am not sure how that works or if it is even possible. The result of the classification should be one classifier, the trained, averaged classifier.
I have written the code using scikit-learn SVM Linear kernel, and it works, but now I need to bring it into a map-reduce, parallelized context, and I don't even know how to begin.
Any advice?
Make sure that all of the required libraries (scikit-learn, NumPy, pandas) are installed on every node in your cluster.
Your mapper will process each line of input, i.e., your training row and emit a key that basically represents the fold for which you will be training your classifier.
Your reducer will collect the lines for each fold and then run the sklearn classifier on all lines for that fold.
You can then average the results from each fold.
Related
The concept of KNN is to find the nearest data points to the required data.
therefore there is no math or processes before testing the model.
all it does is finding closest K points which mean no training process.
if this is right, then what happens in the training process for KNN in python??
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
Then something happen in the background when fit gets called.
What is that happening if the process requires no calculations
KNN is not quite a specific algorithm on itself, but rather a method that you can implement in several ways. The idea behind nearest neighbors is to select one or more examples from the training data to decide the predicted value for the sample at hand. The simplest way to do that is to simply iterate through the whole dataset and pick the closest data points from the training dataset. In that case, you could skip the fitting step, or you could see the fitting as the production of a callable function that runs that loop. Even in that case, is you are using a library like scikit-learn, it is useful to maintain a similar interface to all predictors, so you can write generic code for them (e.g. training code independent from the specific algorithm used).
However, you can do smarter things for KNN too. In scikit-learn, you will see that KNeighborsClassifier implements three different algorithms. One is brute force, which is just traversing the whole dataset as described, but you also have BallTree (wiki) and KDTree (wiki). These are data structures that can accelerate the search for nearest neighbors, but they need to be constructed in advance from the data. So the fitting step here is building the data structure that will help you find the nearest neighbors.
I have a training dataset of shape(90000,50) and I trying to fit this in model(Gaussian process regression). This errors out with memory error. I do understand the computation, but is there way to pass data in batches using scikit? I am using the scikit implementation of the GPR algorithm.
Keras has generator because, you can create checkpoints and resume from where you left off in Neural Networks. However, not all of trainable algorithms has this property. Take a look at incremental learning from Scikit-API docs.
The Gaussian process implementation(Regression/classification) from scikit is'nt capable of handling big dataset. It can run only upto 15000 rows of data. So I decided to use a different algorithm instead as this seems to be a problem with algorithm.
Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks
No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.
I have a dataset which includes 200000 labelled training examples.
For each training example I have 10 features, including both continuous and discrete.
I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).
First let me write the code which I have written so far:
from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)
The problem is that I get really low accuracy (too many misclassified labels) - around 20%.
However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Any thoughts or suggestions will be much appreciated.
The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.