What is the purpose of the holdout set in k-means clustering?

What is the purpose of the holdout set in k-means clustering? - python

Link to the MIT problem set
Here are my current thoughts--please point to where I'm wrong :)
What I believe: The holdout set's purpose is to foil,
contrast, for the training set - to prove that the
k-means eliminates error at each round.
To do this, the holdout set shows the error at the very begin-
ning, i.e. it doesn't recompute the centroid of each clusters
to be at the very center of each cluster, after each
point has been assigned. It just stops, and the error is
calculated.
The training set, for the initial 80% of the points--
partitioned using randomPartition()--simply go through
the entire k-means function, and return the error after
that.)
Where I'm probably wrong: The problem probably just
requests another run of k-means, but with a smaller set.
Also, the way of calculating error for training set vs. the holdout
set seem identical to me. They're probably not.
Also, I heard something about it involving feature selection.
Current methods I'm considering based on current belief:
Duplicate the k-means function, and modify the duplicate
so that it returns the clusters, maxDistance after initial
run. Use this function for the holdout set.

The goal of clustering is to group similar data points. But how would you know if the similar data points you have grouped are grouped correctly? How can you judge your results? For this reason you divide your available data into 2 sets: training and holdout.
Take this as an analogy.
Think about training set as practice questions for some examination. You work the practice questions, try to do best in it and improve your skills.
You can think holdout set as the actual examination. If you have worked good on the practice questions (training set) then you will probably perform good in the examination (holdout set).
Now you know how well did you do in practice and examination (of-course after attempting ) based on which you can infer your overall performance and judge what is good (what number of clusters are good or how good is the data clustered).
So you will apply your clustering algorithm on the training data but not on holdout data and find out cluster centers (representatives of clusters). For holdout data, you will simply use the cluster centers you have found from algorithm and assign data-points to cluster whose center is nearest. Calculate your performance on training and holdout data based on some performance metric (squared distance error in your case). Finally compare these metrics over different values of k to get a good judgement. There is more to it but for assignment sake it seems enough.
In practice, there are many other methods. But the key idea in most of them is same. There is a statistics community where you can find more similar questions: https://stats.stackexchange.com/
References:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method

Related

Confusion on using clustering algorithms that do and do not generate centroids

Firstly I want to apologize if any of my questions seem obvious or dumb. I'm still doing a lot of learning in this field so I would still appreciate any help I can get, even if the answer is obvious. I'm not sure if this belongs in stack overflow or not, but I thought I would give it a try here.
My understanding for clustering has always been that you never need to split the data to training and testing since you don't have any labels or are ignoring the labels. But from some recent reading that I have done online, I see that some people are saying that train/test splitting is still necessary for some clustering algorithms (those that generate centroids). So I just have some scenarios listed below, and I was hoping if some people could help me better understand what I have done wrong in those scenarios and what I need to do to fix it.
For all scenarios, my dataset is 95% unlabeled and 5% labeled, with 2 features. There has also been some scaling done on the dataset entirely prior to these scenarios. I know you're supposed to train/test split before scaling, but again, I thought you don't need to split for clustering.
Scenario 1
Using sklearn.cluster.KMeans, I ran KMeans on the dataset via the fit_predict() method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. I now feel that what I have done now is incorrect because the method introduces data leakage. Was I supposed to train/test split first with the unlabeled/labeled data, then do the scaling separately, then fit() on the unlabeled data, then predict() on the labeled data?
Scenario 2
Using scipy.cluster.hierarchy.linkage, I ran linkage on the dataset to get my linkage array. I drew the dendrogram, selected a distance cutoff point and saw how many clusters I would get, and generated the cluster labels via the fcluster method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. Am I doing anything wrong in this situation so far? I'm not sure what my next steps are from here. Since hierarchical clustering doesn't generate any centroids, how can it be used to properly classify/predict data?
Thank you.

How to effectively tune the hyper-parameters of Gensim Doc2Vec to achieve maximum accuracy in Document Similarity problem?

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?

The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!

Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

Accuracy difference on normalization in KNN

I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.

To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!

That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.

If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.

Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.

For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.