The concept of KNN is to find the nearest data points to the required data.
therefore there is no math or processes before testing the model.
all it does is finding closest K points which mean no training process.
if this is right, then what happens in the training process for KNN in python??
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
Then something happen in the background when fit gets called.
What is that happening if the process requires no calculations
KNN is not quite a specific algorithm on itself, but rather a method that you can implement in several ways. The idea behind nearest neighbors is to select one or more examples from the training data to decide the predicted value for the sample at hand. The simplest way to do that is to simply iterate through the whole dataset and pick the closest data points from the training dataset. In that case, you could skip the fitting step, or you could see the fitting as the production of a callable function that runs that loop. Even in that case, is you are using a library like scikit-learn, it is useful to maintain a similar interface to all predictors, so you can write generic code for them (e.g. training code independent from the specific algorithm used).
However, you can do smarter things for KNN too. In scikit-learn, you will see that KNeighborsClassifier implements three different algorithms. One is brute force, which is just traversing the whole dataset as described, but you also have BallTree (wiki) and KDTree (wiki). These are data structures that can accelerate the search for nearest neighbors, but they need to be constructed in advance from the data. So the fitting step here is building the data structure that will help you find the nearest neighbors.
Related
I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.
from what i understand about random forest alogirthm is that the algorithm randomly samples the original dataset to build a new sampled/bootstrapped dataset. the sampled dataset then turned into decision trees.
in scikit learn, you can visualize each individual trees in random forest. but my question is, How to show the sampled/bootstrapped dataset from each of those trees?
i want to see the features and the rows of data used to build each individual trees.
I am not aware of a way to see the bootstrapped rows (samples) of your data, as this is a random process. Neither do I think it is of much importance, if the algorithm trained well.
Nevertheless, for the features: In your visualization of the tree you might already see which features were used by the splits that have been done. But you can also access them directly via the attribute feature_names_in_ (see here) for each DecisionTree in the forest, e.g.
print(rf.estimators_[0].feature_names_in_)
If your feature names in the data have not been defined as given in the documentation, a workaround is to use the feature_importances_ (see here) as a proxy instead - the features that were unused obviously have an importance of 0. You can contrast this with the number of max_features_ that you define for the training to see if you caught all.
I have my own precomputed data for running AP or Kmeans in python. However when I go to run predict() as I would like to run a train() and test() on the data to see if the clusterings have a good accuracy on the class or clusters, Python tells me that predict() is not available for "precomputed" data.
Is there another way to run a train / test on clustered data in python?
Most clustering algorithms, including AP, have no well-defined way to "predict" on new data. K-means is one of the few cases simple enough to allow a "prediction" consistent with the initial clusters.
Now sklearn has this oddity of trying to squeeze everything into a supervised API. Clustering algorithms have a fit(X, y) method, but ignore y, and are supposed to have a predict method even though the algorithms don't have such a capability.
For affinity propagation, someone at some point decided to add a predict based on k-means: It always predicts the nearest center. Computing the mean only is possible with coordinate data, and hence the method fails with metric=precomputed.
If you want to replicate this behavior, computer the distances to all cluster centers, and choose the argmin, that's all. You can't fit this into the sklearn API easily with "precomputed" metrics. You could require the user to pass a distance vector to all "training" examples for the precomputed metric, but only few of them are needed...
In my opinion, I'd rather remove this method altogether:
It is not in published research on affinity propagation that I know
Affinity propagation is based on concepts of similarity ("affinity") not on distance or means
This predict will not return the same results as the points were labeled by AP, because AP is labeling points using a "propagated responsibility", rather than the nearest "center". (The current sklearn implementation may be losing this information...)
Clustering methods don't have a consistent predict anyway - it's not a requirement to have this.
If you want to do this kind of prediction, just pass the cluster centers to a nearest neighbor classifier. That is what is re-implemented here, a hidden NN classifier. So you get more flexibility if you make prediction a second (classification) step.
Note that it clustering it is not common to do any test-train split, because you don't use the labels anyway, and use only unsupervised evaluation methods (if any at all, because these have their own array of issues) if any at all - you cannot reliably do "hyperparameter optimization" here, but have to choose parameters based on experience and humans looking at the data.
I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.
self-learner here.
I am building a web application that predict events.
Let's consider this quick example.
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
How can I keep the state of neigh so when I enter a new value like neigh.predict([[1.2]]) I don't need to re-train the model. Is there any good practice, or hint to start solving the problem ?
You've chosen a slightly confusing example for a couple of reasons. First, when you say neigh.predict([[1.2]]), you aren't adding a new training point, you're just doing a new prediction, so that doesn't require any changes at all. Second, KNN algorithms aren't really "trained" -- KNN is an instance-based algorithm, which means that "training" amounts to storing the training data in a suitable structure. As a result, this question has two different answers. I'll try to answer the KNN question first.
K Nearest Neighbors
For KNN, adding new training data amounts to appending new data points to the structure. However, it appears that scikit-learn doesn't provide any such functionality. (That's reasonable enough -- since KNN explicitly stores every training point, you can't just keep giving it new training points indefinitely.)
If you aren't using many training points, a simple list might be good enough for your needs! In that case, you could skip sklearn altogether, and just append new data points to your list. To make a prediction, do a linear search, saving the k nearest neighbors, and then make a prediction based on a simple "majority vote" -- if out of five neighbors, three or more are red, then return red, and so on. But keep in mind that every training point you add will slow the algorithm.
If you need to use many training points, you'll want to use a more efficient structure for nearest neighbor search, like a K-D Tree. There's a scipy K-D Tree implementation that ought to work. The query method allows you to find the k nearest neighbors. It will be more efficient than a list, but it will still get slower as you add more training data.
Online Learning
A more general answer to your question is that you are (unbeknownst to yourself) trying to do something called online learning. Online learning algorithms allow you to use individual training points as they arrive, and discard them once they've been used. For this to make sense, you need to be storing not the training points themselves (as in KNN) but a set of parameters, which you optimize.
This means that some algorithms are better suited to this than others. sklearn provides just a few algorithms capable of online learning. These all have a partial_fit method that will allow you to pass training data in batches. The SKDClassifier with 'hinge' or 'log' loss is probably a good starting point.
Or maybe you just want to save your model after fitting
joblib.dump(neigh, FName)
and load it when needed
neigh = joblib.load(FName)
neigh.predict([[1.1]])