The documentation for sklearn.cluster.AgglomerativeClustering mentions that,
when varying the number of clusters and using caching,
it may be advantageous to compute the full tree.
This seems to imply that it is possible to first compute the full tree, and then quickly update the number of desired clusters as necessary, without recomputing the tree (with caching).
However this procedure for changing the number of clusters does not seem to be documented. I would like to do this but am unsure how to proceed.
Update: To clarify, the fit method does not take number of clusters as an input:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
You set a cacheing directory with the paramater memory = 'mycachedir' and then if you set compute_full_tree=True, when you rerun fit with different values of n_clusters, it will used the cached tree rather than recomputing each time. To give you an example of how to do this with sklearn's gridsearch API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
I know it's an old question, however the solution below might turn out helpful
# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
sc = silhouette_score(
euc_scores,
metric="precomputed",
labels=local_labels,
random_state=42)
if silh_score < sc:
silh_score = sc
labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...
Related
I am trying to get my head around how to use KNeighborsTransformer correctly, so I am using the Iris dataset to test it.
However, I find that when I use KNeighborsTransformer before the KNeighborsClassifier I get different results than using KNeighborsClassifier directly.
When I plot the decision boundaries, they are similar, but different.
I have given the metric and weights mode explicitly, so that cannot be the problem.
Why do I get this difference?
Does it have something to do with whether they count a point as its own nearest neighbour?
Or does it have something to do with the metric='precomputed'?
Below is the code I use to consider the two classifiers.
import numpy as np
from sklearn import neighbors, datasets
from sklearn.pipeline import make_pipeline
# import data
iris = datasets.load_iris()
# We only take the first two features.
X = iris.data[:, :2]
y = iris.target
n_neighbors = 15
knn_metric = 'minkowski'
knn_mode = 'distance'
# With estimator with KNeighborsTransformer
estimator = make_pipeline(
neighbors.KNeighborsTransformer(
n_neighbors = n_neighbors + 1, # one extra neighbor should already be computed when mode == 'distance'. But also the extra neighbour should be filtered by the following KNeighborsClassifier
metric = knn_metric,
mode = knn_mode),
neighbors.KNeighborsClassifier(
n_neighbors=n_neighbors, metric='precomputed'))
estimator.fit(X, y)
print(estimator.score(X, y)) # 0.82
# with just KNeighborsClassifier
clf = neighbors.KNeighborsClassifier(
n_neighbors,
weights = knn_mode,
metric = knn_metric)
clf.fit(X, y)
print(clf.score(X, y)) # 0.9266666666666666
Your pipeline approach uses the default uniform vote, but your direct approach uses the distance-weighted vote. Making them match (either both distance or both uniform) almost makes the behavior match; the seeming remaining difference is in tie-breaking of nearest neighbors; I'm not sure yet why the tie-breaking is happening differently in the two cases, but it's likely not such a big issue with more realistic datasets.
Why my KNN Classifier build from scratch with numpy gives different results than the sklearn.KNeighborsClassifier? What is wrong with my code?
# create a function that computes euclidean distance and return the most common class label
# for given k.
def k_neighbors(self, x):
lengths = [self.euclidean_length(x, x_train) for x_train in self.X_training]
k_index = np.argsort(lengths)[: self.k]
k_nearest_labels = [self.y_training[i] for i in k_index]
counts = np.bincount(k_nearest_labels)
most_common_label = np.argmax(counts)
return most_common_label
# running KNN classifier with K=5 to fit the data and make predictions.
classifier1 = KNN_Classifier(k=5)
classifier1.fit(X_training, y_training)
predicted1 = classifier1.predicting(X_test)
They both apparently do the same but I have different outcomes. Where is the bug in my code?
from sklearn.neighbors import KNeighborsClassifier
classifier2 = KNeighborsClassifier(n_neighbors=5, algorithm='brute', p=2)
classifier2.fit(X_training, y_training)
predicted2 = classifier2.predict(X_test)
Based on sklearn documentation, there are multiple reasons:
Distance metric: you are using Euclidean distance metric, while sklearn by default uses minkowski which in X,Y make differences
To find k nearest neighbours, sklearn, by default, choose one of the kd_tree, BallTree and BruteForce methods, however, in your k_neighbours() function, you use BruteForce.
Last but not least, k value in your test is 5, while you're using 4 for skleran equivalent
I'm using sklearn Kmeans algorithm for grouping in 4 clusters multiple observations and I have included init_state and seed for obtaining always the same results; but each time that I reload the code in google colab and each time I'm running the training I obtain different results in terms of number of observations in each cluster, here the code:
import numpy as np
np.random.seed(5)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4,init='k-means++',n_init=1,max_iter=3000,random_state=354)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
How I can obtain always the same results (in terms of the number of observation in each cluster)?
Thank you in advance
Here's from the doc
If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.
To get a good handle of max_iter, see k_means from scikit.cluster Setting return_n_iter to True gets best_n_iter which corresponds to the number of iterations to get the best results.
Here's an example:
centroids, best_iter = k_means(X, n_clusters=2, init='kmeans++', random_state=0, return_n_iter=True)
I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865
I am working on a numerical dataset using KNN Classifier of sklearn package.
Once the prediction is complete, the top 4 important variables should be displayed in a bar graph.
Here is the solution I have tried, but it throws an error that feature_importances is not an attribute of KNNClassifier:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
(pd.Series(neigh.feature_importances_, index=X_test.columns)
.nlargest(4)
.plot(kind='barh'))
Now to display the variable importance graph for decision tree: the argument passed to pd.series() is classifier.feature_importances_
For SVM, Linear discriminant analysis the argument passed to pd.series() is classifier.coef_[0].
However, I am unable to find a suitable argument for KNN classifier.
Feature importance is not defined for the KNN Classification algorithm. There is no easy way to compute the features responsible for a classification here. What you could do is use a random forest classifier which does have the feature_importances_ attribute. Even in this case though, the feature_importances_ attribute tells you the most important features for the entire model, not specifically the sample you are predicting on.
If you are set on using KNN though, then the best way to estimate feature importance is by taking the sample to predict on, and computing its distance from each of its nearest neighbors for each feature (call these neighb_dist). Then do the same computations for a few random points (call these rand_dist) instead of the nearest neighbors. Then for each feature, you take the ratio of neighb_dist / rand_dist, and the smaller the ratio, the more important that feature is.
Gere is a good, and generic, example.
#importing libraries
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1) #Feature Matrix
y = df["MEDV"] #Target Variable
df.head()
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
All details are listed below.
https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
Here are two more great examples of the same.
https://www.scikit-yb.org/en/latest/api/features/importances.html
https://github.com/WillKoehrsen/feature-selector/blob/master/Feature%20Selector%20Usage.ipynb