For PCA we can see variance_score and say how much percentage of original data variance is included in each Principal Components. With these variance scores, we can plot an elbow graph and decide the dimension for visualising data. But for t-SNE, I couldnt find any.
Is there any way to decide the number of dimensions in t-SNE?
Use the n_components parameter in the object construction. If you're using a large dataset, UMAP scales better at reducing into multiple dimensions than tSNE.
Related
I have implemented a K-Means clustering on a dataset in which I have reduced the dimensionality to 2 features with PCA.
Now I am wondering how to interprete this analysis since there is any reference on which are the variables on the axis. Given that doubt, I am also wondering if it is a good practice implementg a K-Means on a resized dataset with PCA.
How can I interprete this kind of clustering?
Thank you!
It is hard to give an answer addressing your question since it is not specific enough and I have no idea about the data and the objective question of your research. So, let me answer your question in general perspective if it helps.
First of all, PCA strictly decreases interpretability of the analysis beacuse it reduces the dimensions depending on linear relations of variables and you can not name reduced components anymore. In addition, check the correlation scores among the variables before PCA to get intiution how much PCA will be successful and check variance explained by PCA. The lower explained variance ratio, the greater the information loss. So it may mislead your intreptations.
If your objective is to analyse data and make inferences, I would suggest you not to reduce dimension. You have 3 dimensions only. You can apply K-Means without PCA and plot them in 3D. Matplotlib and plotly provide interactive feature for this.
However, If your objective is to build a macine learning model, then you should reduce the dimension if they are highly correlated. This would be a big favor for your model.
Finally, applying K-Means after PCA is not something not to do but creates difficulty for interpretations.
I am trying to use scikit's factor analysis on some financial data to find betas to use in a model. FA has a parameters called n_components and tolerance. I am having some trouble wrapping my head around how these variables influence the outcome. I have read the docs and done research but have had trouble finding any relevant information. I am new to machine learning and am not a stats wizard. Could someone explain how these influence the out come of the algorithm?
From sklearn.decomposition.FactorAnalysis
n_components : int | None
Dimensionality of latent space, the number of components of X that are obtained after transform. If None, n_components is set to the number of features.
tol : float
Stopping tolerance for EM algorithm.
I am assuming that your financial data is a matrix with (n_samples, n_features) shape. Factor analysis uses an expectation maximization (EM) optimizer to find the best Gaussian distribution that can accurately model your data within a tolerance of n_tolerance. In simple terms n_components is the dimensionality of the Gaussian distribution.
Data that can be modelled with a Gaussian distribution sometimes has negligible variance in one dimension. Think of an ellipsoid that is squashed along its depth such that it resembles an ellipse. If the raw data was the ellipsoid, you want your n_components = 2, so that you can model your data with the least complicated model.
I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?
I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.
I am working with the Mean Shift clustering algorithm, which is based on the kernel density estimate of a dataset. I would like to generate a large, high dimensional dataset and I thought the Scikit-Learn function make_blobs would be suitable. But when I try to generate a 1 million point, 8 dimensional dataset, I end up with almost every point being treated as a separate cluster.
I am generating the blobs with standard deviation 1, and then setting the bandwidth for the Mean Shift to the same value (I think this makes sense, right?). For two dimensional datasets this produced fine results, but for higher dimensions I think I'm running into the curse of dimensionality in that the distance between points becomes too big for meaningful clustering.
Does anyone have any tips/tricks on how to get a good high-dimensional dataset that is suitable for (something like) Mean Shift clustering? (or am I doing something wrong? (which is of course a good possibility))
The standard deviation of the clusters isn't 1.
You have 8 dimensions, each of which has a stddev of 1, so you have a total standard deviation of sqrt(8) or something like that.
Kernel density estimation does not work well in high-dimensional data because of bandwidth problems.
I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?