What happens after creating PCA? - python

What happens after I create a dimensionality reduction algorithm (PCA) that has produced a matrix W?
How do I now use it to predict real time data?
Do I need to create a User interface or what?
If thats the case, how and where?
Im completely lost

Depends on what you wanted to achieve by doing PCA. One use case can be Clustering. If you just look at the summary of your PCA model you'll be able to see the list of Principal Components and the Proportion of variance explained by each Principal Component. You can choose components which explain most of the variance (normally the cumulative proportion of variance explained by PCs should be >80%) in your model. You can plot a Scree plot and look for break in the graph to determine number of Components to use in Clustering. Input those PCs in your clustering model and you might see some good clustering results.
For your second query follow this Link
And you don't need to create any user interface right now. Just follow the above link and you should get some understanding around PCA and as to how to use it for predictive modeling.

Related

Interpretation K-Means clustering with PCA

I have implemented a K-Means clustering on a dataset in which I have reduced the dimensionality to 2 features with PCA.
Now I am wondering how to interprete this analysis since there is any reference on which are the variables on the axis. Given that doubt, I am also wondering if it is a good practice implementg a K-Means on a resized dataset with PCA.
How can I interprete this kind of clustering?
Thank you!
It is hard to give an answer addressing your question since it is not specific enough and I have no idea about the data and the objective question of your research. So, let me answer your question in general perspective if it helps.
First of all, PCA strictly decreases interpretability of the analysis beacuse it reduces the dimensions depending on linear relations of variables and you can not name reduced components anymore. In addition, check the correlation scores among the variables before PCA to get intiution how much PCA will be successful and check variance explained by PCA. The lower explained variance ratio, the greater the information loss. So it may mislead your intreptations.
If your objective is to analyse data and make inferences, I would suggest you not to reduce dimension. You have 3 dimensions only. You can apply K-Means without PCA and plot them in 3D. Matplotlib and plotly provide interactive feature for this.
However, If your objective is to build a macine learning model, then you should reduce the dimension if they are highly correlated. This would be a big favor for your model.
Finally, applying K-Means after PCA is not something not to do but creates difficulty for interpretations.

Linear regression model selection for time series data prediction

I have a signal and want to predict y which present Number of requests, using regression models. Currently, I am using OLS regression model to predict y. But the prediction error is very high, as my signal has a lot of variations (ups and downs) as shown below.
I noticed that my model most of the time overestimate y (Number of Requests), especially if the points to be predicted is preceded by large value of y's. As indicated below in the yellow and red circle.
So I am not sure if there's a robust regression models to accommodate this problem of having a lot of variations in my datasets. Also is there any way to segment out these large values by adapting the window size such that it doesn't include these values?
Could you please advise
From the visualization of the error I would say a linear model is not appropriate and you should consider using something that handles periodic data as well as moving average - your data appears to have periodic elements, and a moving average element that goes beyond something "linear". Consider something like ARIMA. Here's a link to a tutorial on ARIMA: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/ Please post the results :)
Vishaal

Semi-supervised Gaussian mixture model clustering in Python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

Scikit Learn Variable Bias

I am using Scikit to make some prediction on a very large set of data. The data is very wide, but not very long so I want to set some weights to the parts of the data. If I know some parts of the data are more important then other parts how should I inform SCikit of this, or does it kinda break the whole machine learning approach to do some pre-teaching.
The most straightforward way of doing this is perhaps by using Principal Component Analysis on your data matrix X. Principal vectors form an orthogonal basis of X, and they are each one a linear combination of the original feature space (normally columns) of X. The decomposition is such that each principal vector has a corresponding eigenvalue (or singular value depending on how you compute PCA) a scalar that reflects how much reconstruction can be made solely on the basis of that principal vector alone, in a least-squares sense.
The magnitude of coefficients of principal vectors can be interpreted as importance of the individual features of your data, since each coefficient maps 1:1 to a feature or column of the matrix. By selecting one or two principal vectors and examining their magnitudes, you may have a preliminary insight of what columns are more relevant, of course up to how much these vectors approximate the matrix.
This is the detailed scikit-learn API description. Again, PCA is a simple but just one way of doing it, among others.
This probably depends a bit on the machine learning algorithm you're using -- many will discover feature importances on their own (as elaborated via the feature_importances_ property in random forest and others).
If you're using a distance-based measure (e.g. k-means, knn) you could manually weight the features differently by scaling the values of each feature accordingly (though it's possible scikit does some normalization...).
Alternatively, if you know some features really don't carry much information you could simply eliminate them, though you'd lose any diagnostic value these features might unexpectedly bring. There are some tools in scikit for feature selection that might help make this kind of judgement.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Categories

Resources