Sklearn KNeighborsClassifier gives different results when using KNeighborsTransformer? - python

I am trying to get my head around how to use KNeighborsTransformer correctly, so I am using the Iris dataset to test it.
However, I find that when I use KNeighborsTransformer before the KNeighborsClassifier I get different results than using KNeighborsClassifier directly.
When I plot the decision boundaries, they are similar, but different.
I have given the metric and weights mode explicitly, so that cannot be the problem.
Why do I get this difference?
Does it have something to do with whether they count a point as its own nearest neighbour?
Or does it have something to do with the metric='precomputed'?
Below is the code I use to consider the two classifiers.
import numpy as np
from sklearn import neighbors, datasets
from sklearn.pipeline import make_pipeline
# import data
iris = datasets.load_iris()
# We only take the first two features.
X = iris.data[:, :2]
y = iris.target
n_neighbors = 15
knn_metric = 'minkowski'
knn_mode = 'distance'
# With estimator with KNeighborsTransformer
estimator = make_pipeline(
neighbors.KNeighborsTransformer(
n_neighbors = n_neighbors + 1, # one extra neighbor should already be computed when mode == 'distance'. But also the extra neighbour should be filtered by the following KNeighborsClassifier
metric = knn_metric,
mode = knn_mode),
neighbors.KNeighborsClassifier(
n_neighbors=n_neighbors, metric='precomputed'))
estimator.fit(X, y)
print(estimator.score(X, y)) # 0.82
# with just KNeighborsClassifier
clf = neighbors.KNeighborsClassifier(
n_neighbors,
weights = knn_mode,
metric = knn_metric)
clf.fit(X, y)
print(clf.score(X, y)) # 0.9266666666666666

Your pipeline approach uses the default uniform vote, but your direct approach uses the distance-weighted vote. Making them match (either both distance or both uniform) almost makes the behavior match; the seeming remaining difference is in tie-breaking of nearest neighbors; I'm not sure yet why the tie-breaking is happening differently in the two cases, but it's likely not such a big issue with more realistic datasets.

Related

Large Margin Classifier

I am building a classifier to maximize the margin between positively and negatively labelled points.
I am using sklearn.LinearSVC to do this. I have to find both the weights (a vector, theta) and intercept ( a scalar theta_0). I also need to calculate the maximum margin. So, I wrote the below code.
import numpy as np
import sklearn
from sklearn.svm import LinearSVC
# training data
X_train = np.array([[0,0],[2,0],[3,0],[0,2],[2,2],[5,1],[5,2],[2,4],[4,4],[5,5]])
y_train = [-1,-1,-1,-1,-1,1,1,1,1,1]
classifier = LinearSVC(random_state = 0, C=1.0, fit_intercept= True)
classifier.fit(X_train, y_train)
theta = classifier.coef_
theta_0.intercept_
norm = np.linalg.norm(theta)
margin = 2/norm
As per my understanding, LinearSVC is the right package for this; though I see some tutorials in which people use SVC and then kernel = 'linear'.
I am not sure whether I should set the fit_intercept parameter to True. I am getting a different value for theta and theta_0 when I default it to False.
Can somebody guide me on the understanding of this parameter and also whether the margin calculation is correct? Lastly, whether LinearSVC is the right model. Thanks.
This statement is wrong:
theta_0.intercept_
I assume that it should be:
theta_0 = classifier.intercept_

Backward Elimination on large datasets in python

I took an online course where the instructor explained backward elimination using a dataset(50,5) where you eliminate the columns manually by looking at their p-values.
import statsmodels.api as sm
X = np.append(arr = np.ones((2938, 1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
# Second Step
X_opt = X[:, [0,1,,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
# and so on
Now while practicing on on an large dataset such as (2938, 214) which I have, do I have to eliminate all the columns myself? Because that is a lot of work, or is there some sort of algorithm or way to do it.
This might be a stupid question but I am a begineer in machine learning so any help is appreciated.Thanks
What you sre trying to do is called "Recursive Feature Eliminatio ", RFE for short.
Example from sklearn.feature_selection.RFE:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
This would eliminate features using SVR one by one until only 5 most important are left. You could use any algorithm which provides feature_importances_ object member.
When it comes to p-values you could eliminate all greater than threshold (provided the null hypothesis is this coefficient has no meaning, e.g. is zero), but see below.
Just remember, usually coefficients weights will change as some of them are removed (as here or in RFE), so it's only an approximation dependent on many factors. You could do other preprocessing like removing correlated features or using OLS with L1 penalty which will choose only the most informative factors.

Should GridSearchCV score results be equal to score of cross_validate using same input?

I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865

sklearn agglomerative clustering: dynamically updating the number of clusters

The documentation for sklearn.cluster.AgglomerativeClustering mentions that,
when varying the number of clusters and using caching,
it may be advantageous to compute the full tree.
This seems to imply that it is possible to first compute the full tree, and then quickly update the number of desired clusters as necessary, without recomputing the tree (with caching).
However this procedure for changing the number of clusters does not seem to be documented. I would like to do this but am unsure how to proceed.
Update: To clarify, the fit method does not take number of clusters as an input:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
You set a cacheing directory with the paramater memory = 'mycachedir' and then if you set compute_full_tree=True, when you rerun fit with different values of n_clusters, it will used the cached tree rather than recomputing each time. To give you an example of how to do this with sklearn's gridsearch API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
I know it's an old question, however the solution below might turn out helpful
# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
sc = silhouette_score(
euc_scores,
metric="precomputed",
labels=local_labels,
random_state=42)
if silh_score < sc:
silh_score = sc
labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...

Train scikit SVM, customize score assessment

I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.

Categories

Resources