I have implemented a K-Means clustering on a dataset in which I have reduced the dimensionality to 2 features with PCA.
Now I am wondering how to interprete this analysis since there is any reference on which are the variables on the axis. Given that doubt, I am also wondering if it is a good practice implementg a K-Means on a resized dataset with PCA.
How can I interprete this kind of clustering?
Thank you!
It is hard to give an answer addressing your question since it is not specific enough and I have no idea about the data and the objective question of your research. So, let me answer your question in general perspective if it helps.
First of all, PCA strictly decreases interpretability of the analysis beacuse it reduces the dimensions depending on linear relations of variables and you can not name reduced components anymore. In addition, check the correlation scores among the variables before PCA to get intiution how much PCA will be successful and check variance explained by PCA. The lower explained variance ratio, the greater the information loss. So it may mislead your intreptations.
If your objective is to analyse data and make inferences, I would suggest you not to reduce dimension. You have 3 dimensions only. You can apply K-Means without PCA and plot them in 3D. Matplotlib and plotly provide interactive feature for this.
However, If your objective is to build a macine learning model, then you should reduce the dimension if they are highly correlated. This would be a big favor for your model.
Finally, applying K-Means after PCA is not something not to do but creates difficulty for interpretations.
Related
I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
On slide no 33 its mentioned that K-means has problems when clusters are of different
Sizes
Densities
Non globular shapes
Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.
Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization
https://archive.ics.uci.edu/ml/datasets/online+retail
It gave me clusters with highly uneven distribution of 4346, 23, 3
I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.
That's a limit of k-means. You don't really have a hard fact if your clustering is good or not.
Pre-steps could be:
Normalization/Standardization of the data with StandardScaler
Missing value handling
Dimension reduction (there are several techniques like: PCA), especially if you have a lot of dimensions
Random initialization (it can vary from the start point)
A real method how good your k-means clustering is doesn't really exists, here is a topic about how to "measure" it: Clustering Quality Measure
I am working on a project currently and I wish to cluster multi-dimensional data. I tried K-Means clustering and DBSCAN clustering, both being completely different algorithms.
The K-Means model returned a fairly good output, it returned 5 clusters but I have read that when the dimensionality is large, the Euclidean distance fails so I don't know if I can trust this model.
On trying the DBSCAN model, the model generated a lot of noise points and clustered a lot of points in one cluster. I tried the KNN dist plot method to find the optimal eps for the model but I can't seem to make the model work. This led to my conclusion that maybe the density of the points plotted is very high and maybe that is the reason I am getting a lot of points in one cluster.
For clustering, I am using 10 different columns of data. Should I change the algorithm I am using? What would be a better algorithm for multi-dimensional data and with less-varying density?
You can first make a dimension reduction on your dataset with PCA/LDA/t-sne or autoencoders. Then run standart some clustering algorithms.
Another way is you can use fancy deep clustering methods. This blog post is really nice explanation of how they apply deep clustering on the high dimensional dataset.
Maybe this provides you with some inspiration: Scikit-learn clustering algorithms
I suggest you try a few out. Hope that helps!
I'm studing machine learning from here and the course uses 'scikit learn' from regression - https://www.udemy.com/machinelearning/
I can see that for some training regression algorithms, the author uses feature scaling and for some he doesn't because some 'scikit learn' regression algorithms take care of feature scaling by themselves.
How to know in which training algorithm we need to do feature scaling and where we don't need to ?
No machine learning technique needs feature scaling, for some algoirthms scaled inputs make the optimizing easier on the computer which results in faster training time.
Typically, algorithms that leverage distance or assume normality benefit from feature scaling. https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
It depends on the algorithm you are using and your dataset.
Support Vector Machines (SVM), these models converge faster if you scale your features . The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges
In K-means clustering, you find out the Euclidean distance for clustering different data points together. Thus it seems to be a good reason to scale your features so that the centroid doesn't get much affected by the large or abnormal values.
In case of regression, scaling your features will not be of much help since the relation of coefficients between original dataset and the relation of coefficients between scaled dataset will be the same.
In case of Decision Trees, they don't usually require feature scaling.
In case of models which have learning rates involved and are using gradient descent, the input scale does effect the gradients. So feature scaling would be considered in this case.
A very simple answer. Some algorithm does the feature scaling even if you don't and some do not. So, if the algorithm does not, you need to manually scale the features.
You can google which algorithm does the feature scaling, but its good to be safe by manually scaling the feature. Always make sure, the features are scaled, otherwise, the algorithm would give output offset to ideal.
I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?
I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.
I am using Scikit to make some prediction on a very large set of data. The data is very wide, but not very long so I want to set some weights to the parts of the data. If I know some parts of the data are more important then other parts how should I inform SCikit of this, or does it kinda break the whole machine learning approach to do some pre-teaching.
The most straightforward way of doing this is perhaps by using Principal Component Analysis on your data matrix X. Principal vectors form an orthogonal basis of X, and they are each one a linear combination of the original feature space (normally columns) of X. The decomposition is such that each principal vector has a corresponding eigenvalue (or singular value depending on how you compute PCA) a scalar that reflects how much reconstruction can be made solely on the basis of that principal vector alone, in a least-squares sense.
The magnitude of coefficients of principal vectors can be interpreted as importance of the individual features of your data, since each coefficient maps 1:1 to a feature or column of the matrix. By selecting one or two principal vectors and examining their magnitudes, you may have a preliminary insight of what columns are more relevant, of course up to how much these vectors approximate the matrix.
This is the detailed scikit-learn API description. Again, PCA is a simple but just one way of doing it, among others.
This probably depends a bit on the machine learning algorithm you're using -- many will discover feature importances on their own (as elaborated via the feature_importances_ property in random forest and others).
If you're using a distance-based measure (e.g. k-means, knn) you could manually weight the features differently by scaling the values of each feature accordingly (though it's possible scikit does some normalization...).
Alternatively, if you know some features really don't carry much information you could simply eliminate them, though you'd lose any diagnostic value these features might unexpectedly bring. There are some tools in scikit for feature selection that might help make this kind of judgement.