Scikit Learn Variable Bias - python

I am using Scikit to make some prediction on a very large set of data. The data is very wide, but not very long so I want to set some weights to the parts of the data. If I know some parts of the data are more important then other parts how should I inform SCikit of this, or does it kinda break the whole machine learning approach to do some pre-teaching.

The most straightforward way of doing this is perhaps by using Principal Component Analysis on your data matrix X. Principal vectors form an orthogonal basis of X, and they are each one a linear combination of the original feature space (normally columns) of X. The decomposition is such that each principal vector has a corresponding eigenvalue (or singular value depending on how you compute PCA) a scalar that reflects how much reconstruction can be made solely on the basis of that principal vector alone, in a least-squares sense.
The magnitude of coefficients of principal vectors can be interpreted as importance of the individual features of your data, since each coefficient maps 1:1 to a feature or column of the matrix. By selecting one or two principal vectors and examining their magnitudes, you may have a preliminary insight of what columns are more relevant, of course up to how much these vectors approximate the matrix.
This is the detailed scikit-learn API description. Again, PCA is a simple but just one way of doing it, among others.

This probably depends a bit on the machine learning algorithm you're using -- many will discover feature importances on their own (as elaborated via the feature_importances_ property in random forest and others).
If you're using a distance-based measure (e.g. k-means, knn) you could manually weight the features differently by scaling the values of each feature accordingly (though it's possible scikit does some normalization...).
Alternatively, if you know some features really don't carry much information you could simply eliminate them, though you'd lose any diagnostic value these features might unexpectedly bring. There are some tools in scikit for feature selection that might help make this kind of judgement.

Related

Does correlation important factor in Unsupervised learning (Clustering)?

I am working with the dataset of size (500, 33).
In particular the data set contains 9 features say
[X_High, X_medium, X_low, Y_High, Y_medium, Y_low, Z_High, Z_medium, Z_low]
Both visually & after correlation matrix calculation I observed that
[X_High, Y_High, Z_High] & [ X_medium, Y_medium, Z_medium ] & [X_low, Y_low, Z_low] are highly correlated (above 85%).
I would like to perform a Clustering algorithm (say K means or GMM or DBSCAN).
In that case,
Is it necessary to remove the correlated features for Unsupervised learning ?
Whether removing correlation or modifying features creates any impact ?
My assumption here is that you're asking this question because in cases of linear modeling, highly collinear variables can cause issues.
The short answer is no, you don't need to remove highly correlated variables from clustering for collinearity concerns. Clustering doesn't rely on linear assumptions, and so collinearity wouldn't cause issues.
That doesn't mean that using a bunch of highly correlated variables is a good thing. Your features may be overly redundant and you may be using more data than you need to reach the same patterns. With your data size/feature set that's probably not an issue, but for large data you could leverage the correlated variables via PCA/dimensionality reduction to reduce your computation overhead.
Removal of features in unsupervised learning is not a complicated matter. You should include features you want to analyze and remove features you don't want to analyze. Including too many features makes inference much more difficult. I can typically do inference on around 10 to 20 features at the most. More than that and you have a mess of a diagram to explain to someone. If you don't need to do inference, then you could consider adding more features but it is still not advisable due to blowing up your vector space.
The objective way to determine if 20 features vs. 100 features improved your segmentation model is to utilize supervised learning to validate the segments. This is one approach towards validating your unsupervised method.

The use of feature scaling in scikit learn

I'm studing machine learning from here and the course uses 'scikit learn' from regression - https://www.udemy.com/machinelearning/
I can see that for some training regression algorithms, the author uses feature scaling and for some he doesn't because some 'scikit learn' regression algorithms take care of feature scaling by themselves.
How to know in which training algorithm we need to do feature scaling and where we don't need to ?
No machine learning technique needs feature scaling, for some algoirthms scaled inputs make the optimizing easier on the computer which results in faster training time.
Typically, algorithms that leverage distance or assume normality benefit from feature scaling. https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
It depends on the algorithm you are using and your dataset.
Support Vector Machines (SVM), these models converge faster if you scale your features . The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges
In K-means clustering, you find out the Euclidean distance for clustering different data points together. Thus it seems to be a good reason to scale your features so that the centroid doesn't get much affected by the large or abnormal values.
In case of regression, scaling your features will not be of much help since the relation of coefficients between original dataset and the relation of coefficients between scaled dataset will be the same.
In case of Decision Trees, they don't usually require feature scaling.
In case of models which have learning rates involved and are using gradient descent, the input scale does effect the gradients. So feature scaling would be considered in this case.
A very simple answer. Some algorithm does the feature scaling even if you don't and some do not. So, if the algorithm does not, you need to manually scale the features.
You can google which algorithm does the feature scaling, but its good to be safe by manually scaling the feature. Always make sure, the features are scaled, otherwise, the algorithm would give output offset to ideal.

See retained variance in scikit-learn manifold learning methods

I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?
I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.

Semi-supervised Gaussian mixture model clustering in Python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Categories

Resources