Kmeans clustering changes for each training - python

I'm using sklearn Kmeans algorithm for grouping in 4 clusters multiple observations and I have included init_state and seed for obtaining always the same results; but each time that I reload the code in google colab and each time I'm running the training I obtain different results in terms of number of observations in each cluster, here the code:
import numpy as np
np.random.seed(5)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4,init='k-means++',n_init=1,max_iter=3000,random_state=354)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
How I can obtain always the same results (in terms of the number of observation in each cluster)?
Thank you in advance

Here's from the doc
If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.
To get a good handle of max_iter, see k_means from scikit.cluster Setting return_n_iter to True gets best_n_iter which corresponds to the number of iterations to get the best results.
Here's an example:
centroids, best_iter = k_means(X, n_clusters=2, init='kmeans++', random_state=0, return_n_iter=True)

Related

Getting probability values for random forest and Gradient Boosting in python

I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.

How to make this KNN code faster in google colab or any other ipython based environment?

I am using google colaboratory for KNN classification of DonorsChoose dataset. When I am applying KNeighbors classifier for the avgw2v and tfidf datasets, the following code takes around 4 hours to execute.
I have already tried running it on kaggle notebooks, still the issue persists.
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
train_auc_set3 = []
cv_auc_set3 = []
K = [51, 101]
for i in tqdm(K):
neigh = KNeighborsClassifier(n_neighbors=i, n_jobs=-1)
neigh.fit(X_tr_set3, y_train)
y_train_set3_pred = batch_predict(neigh, X_tr_set3)
y_cv_set3_pred = batch_predict(neigh, X_cr_set3)
train_auc_set3.append(roc_auc_score(y_train,y_train_set3_pred))
cv_auc_set3.append(roc_auc_score(y_cv, y_cv_set3_pred))
plt.plot(K, train_auc_set3, label='Train AUC')
plt.plot(K, cv_auc_set3, label='CV AUC')
plt.scatter(K, train_auc_set3, label='Train AUC points')
plt.scatter(K, cv_auc_set3, label='CV AUC points')
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
This may be inherently slow. I'm not terribly familiar with this dataset but glancing at it on Kaggle it looks like it contains over 4 million datapoints. From the sklearn page on KNN:
For each iteration, time complexity is O(n_components x n_samples >x min(n_samples, n_features)).
Also keep in mind that for a large dataset, knn is going to have to measure the distance between a given datapoint and all datapoints in the training set in order to make a prediction, which is computationally expensive.
Using large numbers on k for a very large dataset you may get very poor performance. What I might do would be to:
see how much time fitting knn with a single value of k takes and making predictions for the training set with a single value of k takes. If it takes a long time, then that's your problem as I suspect.
Unfortunately sometimes for very large datasets we are constrained in our choice of algorithm by the time complexity of the algorithms we might like to use. Kernel ridge regression, for example, is a great algorithm that just doesn't scale well to large datasets because it has O(N^3) time complexity.

Should GridSearchCV score results be equal to score of cross_validate using same input?

I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865

sklearn agglomerative clustering: dynamically updating the number of clusters

The documentation for sklearn.cluster.AgglomerativeClustering mentions that,
when varying the number of clusters and using caching,
it may be advantageous to compute the full tree.
This seems to imply that it is possible to first compute the full tree, and then quickly update the number of desired clusters as necessary, without recomputing the tree (with caching).
However this procedure for changing the number of clusters does not seem to be documented. I would like to do this but am unsure how to proceed.
Update: To clarify, the fit method does not take number of clusters as an input:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
You set a cacheing directory with the paramater memory = 'mycachedir' and then if you set compute_full_tree=True, when you rerun fit with different values of n_clusters, it will used the cached tree rather than recomputing each time. To give you an example of how to do this with sklearn's gridsearch API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
I know it's an old question, however the solution below might turn out helpful
# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
sc = silhouette_score(
euc_scores,
metric="precomputed",
labels=local_labels,
random_state=42)
if silh_score < sc:
silh_score = sc
labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...

Train scikit SVM, customize score assessment

I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.

Categories

Resources