Changes of clustering results after each time run in Python scikit-learn

Changes of clustering results after each time run in Python scikit-learn - python

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:
vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)
Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:
vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)
I appreciate your helps.

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.
km = KMeans(n_clusters=number_of_k, init='k-means++',
max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)
This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.
I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.
The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.
In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0 and all gray images of the original data set were in the cluster 1. For another data set, I want color images / gray images to be in cluster 0 and cluster 1 as well.
Here is the code I stole from a Kaggler - in addition to set the random_state to a seed, you use the k-mean model returned by KMeans for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn document saying that.
# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans
seed = 42
def create_color_clusters(img_df, cluster_count = 2, cluster_maker=None):
if cluster_maker is None:
cluster_maker = KMeans(cluster_count, random_state=seed)
cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])
img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)
return img_df, cluster_maker
# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)
However, even setting random_state to a int seed cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0 on one machine and clustered as group 1 on another machine. But at least with the same K-Means model (cluster_maker in my code) we make sure data from another distribution will be clustered in the same way as the original data set.

Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.
When I use K-Means I always run it several times and use the best result.

Related

Number of distinct clusters in KMeans is less than n_clusters?

I have some food images stored in a single folder. All the images are unlabeled, nor are they stored into separate folder such as "pasta" or "meat". My current goal is to cluster the images into a number of categories so that I can later assess if the taste of foods depicted in images of the same cluster is similar.
To do that, I load the images and process them in a format that can be fed into the VGG16 for feature extraction and then pass the features to my KMeans to cluster the images. The code I am using is:
path = r'C:\Users\Hi\Documents\folder'
train_dir = os.path.join(path)
model = VGG16(weights='imagenet', include_top=False)
vgg16_feature_list = []
files = glob.glob(r'C:\Users\Hi\Documents\folder\*.jpg')
for i in enumerate(files):
img = image.load_img(img_path,target_size=(224,224))
img_data=image.img_to_array(img)
img_data=np.expand_dims(img_data,axis=0)
img_data=preprocess_input(img_data)
vgg16_feature = model.predict(img_data)
vgg16_feature_np = np.array(vgg16_feature)
vgg16_feature_list.append(vgg16_feature_np.flatten())
vgg16_feature_list_np=np.array(vgg16_feature_list)
print(vgg16_feature_list_np.shape)
print(vgg16_feature_np.shape)
kmeans = KMeans(n_clusters=3, random_state=0).fit(vgg16_feature_list_np)
print(kmeans.labels_)
The issue is that I get the following warning:
ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
How can I fix that?

This is one of these situations where, although your code is fine from a programming point of view, it does not produce satisfactory results due to an ML-related issue (data, model, or both), hence it is rather difficult to "debug" (I'm quoting the word, since this is not the typical debugging procedure, as the code itself runs fine).
At first instance, the situation seems to imply that there is not enough diversity in your features to justify 3 different clusters. And, provided that we remain in a K-means context, there is not much you can do; among the few options available (refer to the documentation for details of the respective parameters):
Increase the number of iterations max_iter (default 300)
Increase the number of different centroid initializations n_init (default 10)
Change the init argument to random (the default is k-means++) or, even better, provide a 3-element array with one sample from each of your targeted clusters (if you already have an idea which these clusters may actually be in your data)
Run the model with different random_state values
Combine the above
If nothing of the above works, it would very likely mean that K-means is actually not applicable here, and you may have to look for alternative approaches (which are out of the scope of this thread). Truth is, as correctly pointed out in the comment below, K-means does not usually work that well with data of such high dimensionality.

Scikit-Learn DBSCAN clustering yielding no clusters

I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?

You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.

How to put the seed values of K-means algorithm?

I am trying to group customers according to a certain given dataset with attributes like DOB, Gender, State, pincode, transaction_id, promocode etc.
Every time I run the algorithm there is a huge difference in the silhouette score of the clustering from the previous one i.e. the result is not consistent.
Probably that is because of the random seeds to the datasets. Here is the line which passes attribute to the algorithm.
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600)
Is there any method to assign clusters or optimise such that after everytime I run the program, the score is consistent and better?
I am using Python 3 with scikit-learn.

It looks (i'm guessing) like you are using scikit-learn.
In this case, just use:
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=MYSEED)
where MYSEED can be an integer, RandomState object or None (default) as explained in above link.
This means:
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)
is inducing deterministic results.
Remark: this only effects k-means random-nature. If you did some splitting / CV to your data, you have to make these operations deterministic too!

You can fix your random_state= to a constant value. But don't tweak this value until you like the results.
If k-means is sensitive to the starting conditions (I.e. the "quality" varies a lot) this usually indicates that the algorithm doesn't work on this data very well. It has been shown that if there is a good k-means clustering then it will be easy to get at least close to this with most runs. So with n_init=25 you should find a good solution almost every time, if there is one. But there are many data sets where k-means cannot find a good solution!

What is the purpose of the holdout set in k-means clustering?

Link to the MIT problem set
Here are my current thoughts--please point to where I'm wrong :)
What I believe: The holdout set's purpose is to foil,
contrast, for the training set - to prove that the
k-means eliminates error at each round.
To do this, the holdout set shows the error at the very begin-
ning, i.e. it doesn't recompute the centroid of each clusters
to be at the very center of each cluster, after each
point has been assigned. It just stops, and the error is
calculated.
The training set, for the initial 80% of the points--
partitioned using randomPartition()--simply go through
the entire k-means function, and return the error after
that.)
Where I'm probably wrong: The problem probably just
requests another run of k-means, but with a smaller set.
Also, the way of calculating error for training set vs. the holdout
set seem identical to me. They're probably not.
Also, I heard something about it involving feature selection.
Current methods I'm considering based on current belief:
Duplicate the k-means function, and modify the duplicate
so that it returns the clusters, maxDistance after initial
run. Use this function for the holdout set.

The goal of clustering is to group similar data points. But how would you know if the similar data points you have grouped are grouped correctly? How can you judge your results? For this reason you divide your available data into 2 sets: training and holdout.
Take this as an analogy.
Think about training set as practice questions for some examination. You work the practice questions, try to do best in it and improve your skills.
You can think holdout set as the actual examination. If you have worked good on the practice questions (training set) then you will probably perform good in the examination (holdout set).
Now you know how well did you do in practice and examination (of-course after attempting ) based on which you can infer your overall performance and judge what is good (what number of clusters are good or how good is the data clustered).
So you will apply your clustering algorithm on the training data but not on holdout data and find out cluster centers (representatives of clusters). For holdout data, you will simply use the cluster centers you have found from algorithm and assign data-points to cluster whose center is nearest. Calculate your performance on training and holdout data based on some performance metric (squared distance error in your case). Finally compare these metrics over different values of k to get a good judgement. There is more to it but for assignment sake it seems enough.
In practice, there are many other methods. But the key idea in most of them is same. There is a statistics community where you can find more similar questions: https://stats.stackexchange.com/
References:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method

Parameter estimation for linear One Class SVM training via libsvm for n-grams

I know, there are multiple questions to this, but not a single one to my particular problem.
I'll simplify my problem in order to make it more clear.
Lets say I have multiple sentences from an english document and I want to classify them using a one class svm (in libsvm) in order to be able to see anomalities (e.g. a german sentence) afterwards.
For training: I have samples of one class only (lets assume other classes are not existing beforehand). I extract all 3-grams (so the feature space includes max. 16777216 different features) and save them in libsvm format (label=1, just in case that matters)
Now I want to estimate my paramters. I tried to use the grid.py using additional parameters, however, the runtime is too big for rbf kernels. So I try using linear kernels (therefore, the grid.py may be changed in order to use only one value of gamma, as it does not matter for linear kernels).
Whatsoever, the smallest c grid.py tests will shown as the best solution (does -c matter for linear kernels?).
Furthermore, it does not matter how much I change the -n (nu) value, everytime the same relation between scores will be achieved (even though the number of support vectors changes). Scores are gathered by using the python implementation. (relation between scores means, that e.g. at first they are -1 and -2, i change nu and afterwards they are e.g. -0.5 and -1, so if i sort them, the same order always appears, as in this example):
# python2
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from svmutil import *
y,x = svm_read_problem("/tmp/english-3-grams.libsvm") # 5000 sentence samples
ym,xm = svm_read_problem("/tmp/german-3-grams.libsvm") # 50 sentence samples
m = svm_train(y,x,"-s 2 -t 2 -n 0.5");
# do the prediction in one or two steps, here is one step:
p_l, p_a, p_v = svm_predict(y[:100]+ym[:100],x[:100]+xm[:100],m)
# p_v are our scores.
# let's plot a roc curve
roc_ret = roc_curve([1]*100+[-1]*100,p_v)
plt.plot(roc_ret[0],roc_ret[1])
plt.show()
Here, everytime the exact same roc-curve is achieved (even though -n is varied). Even if there is only 1 support vector, the same curve is shown.
Hence, my question (let's assume a maximum of 50000 samples per training):
- why is -n not changing anything for the one class training process?
- what parameters do i need to change for a one class svm?
- is a linear kernel the best approach? (+ with regard to runtime) and rbf kernel parameter grid search takes ages for such big datasets
- liblinear is not being use because I want to do anomaly detection = one class svm
Best regards,
mutilis

The performance impact is a result of your huge feature space of 16777216 elements. This results in very sparse vectors for elements like german sentences.
A study by Yang & Petersen, A Comparative Study on Feature Selection in Text Categorization shows, that aggressive feature-selection does not necessarily decrease classification accuracy. I achieved similar results, while performing text classification for (medical) German text documents.
As stated in the comments, LIBLINEAR is fast, because it is build for such sparse data. However, you end up with a linear classifier with all its pitfalls and benefits.
I would suggest the following strategy:
Perform aggressive feature selection (e.g. with InformationGain) with a remaining feature-space of N
Increase N stepwise in combination with cross-validation and find the best maching N for your data.
Go for a grid-search with the N found in 2.
Train your classifier with the best matching parameters found in 3. and the N found in 2.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.