How to put the seed values of K-means algorithm?

How to put the seed values of K-means algorithm? - python

I am trying to group customers according to a certain given dataset with attributes like DOB, Gender, State, pincode, transaction_id, promocode etc.
Every time I run the algorithm there is a huge difference in the silhouette score of the clustering from the previous one i.e. the result is not consistent.
Probably that is because of the random seeds to the datasets. Here is the line which passes attribute to the algorithm.
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600)
Is there any method to assign clusters or optimise such that after everytime I run the program, the score is consistent and better?
I am using Python 3 with scikit-learn.

It looks (i'm guessing) like you are using scikit-learn.
In this case, just use:
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=MYSEED)
where MYSEED can be an integer, RandomState object or None (default) as explained in above link.
This means:
km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)
is inducing deterministic results.
Remark: this only effects k-means random-nature. If you did some splitting / CV to your data, you have to make these operations deterministic too!

You can fix your random_state= to a constant value. But don't tweak this value until you like the results.
If k-means is sensitive to the starting conditions (I.e. the "quality" varies a lot) this usually indicates that the algorithm doesn't work on this data very well. It has been shown that if there is a good k-means clustering then it will be easy to get at least close to this with most runs. So with n_init=25 you should find a good solution almost every time, if there is one. But there are many data sets where k-means cannot find a good solution!

Related

Scikit-Learn DBSCAN clustering yielding no clusters

I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?

You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.

What is the purpose of the holdout set in k-means clustering?

Link to the MIT problem set
Here are my current thoughts--please point to where I'm wrong :)
What I believe: The holdout set's purpose is to foil,
contrast, for the training set - to prove that the
k-means eliminates error at each round.
To do this, the holdout set shows the error at the very begin-
ning, i.e. it doesn't recompute the centroid of each clusters
to be at the very center of each cluster, after each
point has been assigned. It just stops, and the error is
calculated.
The training set, for the initial 80% of the points--
partitioned using randomPartition()--simply go through
the entire k-means function, and return the error after
that.)
Where I'm probably wrong: The problem probably just
requests another run of k-means, but with a smaller set.
Also, the way of calculating error for training set vs. the holdout
set seem identical to me. They're probably not.
Also, I heard something about it involving feature selection.
Current methods I'm considering based on current belief:
Duplicate the k-means function, and modify the duplicate
so that it returns the clusters, maxDistance after initial
run. Use this function for the holdout set.

The goal of clustering is to group similar data points. But how would you know if the similar data points you have grouped are grouped correctly? How can you judge your results? For this reason you divide your available data into 2 sets: training and holdout.
Take this as an analogy.
Think about training set as practice questions for some examination. You work the practice questions, try to do best in it and improve your skills.
You can think holdout set as the actual examination. If you have worked good on the practice questions (training set) then you will probably perform good in the examination (holdout set).
Now you know how well did you do in practice and examination (of-course after attempting ) based on which you can infer your overall performance and judge what is good (what number of clusters are good or how good is the data clustered).
So you will apply your clustering algorithm on the training data but not on holdout data and find out cluster centers (representatives of clusters). For holdout data, you will simply use the cluster centers you have found from algorithm and assign data-points to cluster whose center is nearest. Calculate your performance on training and holdout data based on some performance metric (squared distance error in your case). Finally compare these metrics over different values of k to get a good judgement. There is more to it but for assignment sake it seems enough.
In practice, there are many other methods. But the key idea in most of them is same. There is a statistics community where you can find more similar questions: https://stats.stackexchange.com/
References:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method

Scikit-Learn: Label not x is present in all training examples

I'm trying to do multilabel classification with SVM.
I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.
I'm running this code:
X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]
test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]
classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)
After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length).
I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it?
Thanks.

There are 2 problems here:
1) The missing label warning
2) You are getting all 0's for predictions
The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.
As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.
If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.
I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.
If you made it this far, thanks for reading. Hope that helps.

Changes of clustering results after each time run in Python scikit-learn

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:
vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)
Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:
vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)
I appreciate your helps.

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.
km = KMeans(n_clusters=number_of_k, init='k-means++',
max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)
This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.
I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.
The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.
In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0 and all gray images of the original data set were in the cluster 1. For another data set, I want color images / gray images to be in cluster 0 and cluster 1 as well.
Here is the code I stole from a Kaggler - in addition to set the random_state to a seed, you use the k-mean model returned by KMeans for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn document saying that.
# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans
seed = 42
def create_color_clusters(img_df, cluster_count = 2, cluster_maker=None):
if cluster_maker is None:
cluster_maker = KMeans(cluster_count, random_state=seed)
cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])
img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)
return img_df, cluster_maker
# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)
However, even setting random_state to a int seed cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0 on one machine and clustered as group 1 on another machine. But at least with the same K-Means model (cluster_maker in my code) we make sure data from another distribution will be clustered in the same way as the original data set.

Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.
When I use K-Means I always run it several times and use the best result.

What is "The sum of true positives and false positives are equal to zero for some labels." mean?

I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?

Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.