Understanding scikitlearn PCA.transform function in Python

Understanding scikitlearn PCA.transform function in Python - python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!

When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

Related

PCA in Sklearn: how to return dimensions which explain the most variation, in order? [duplicate]

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them.
My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow:
pca = sklearn.decomposition.PCA()
pca.fit(data_matrix)
When I print pca.explained_variance_ratio_, it outputs an array of variance ratios ordered from highest to lowest, but it doesn't tell me which dimension from the data they correspond to (I've tried changing the order of columns on my matrix, and the resulting variance ratio array was the same).
Printing pca.components_ gives me a 4x4 matrix (I left the original number of components as argument to pca) with some values I can't understand the meaning of...according to scikit's documentation, they should be the components with the maximum variance (the eigenvectors perhaps?), but no sign of which dimension those values refer to.
Transforming the data doesn't help either, because the dimensions are changed in a way I can't really know which one they were originally.
Is there any way I can get this information with scikit's pca? Thanks

The pca.explained_variance_ratio_ returned are the variances from principal components. You can use them to find how many dimensions (components) your data could be better transformed by pca. You can use a threshold for that (e.g, you count how many variances are greater than 0.5, among others). After that, you can transform the data by PCA using the number of dimensions (components) that are equal to principal components higher than the threshold used. The data reduced to these dimensions are different from the data on dimensions in original data.
you can check the code from this link:
http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.

Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.

If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

Multiple linear regression: appending an array on ones to Matrix of features (Python)

I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)

To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.

I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.

Understanding output from kmeans clustering in python

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:
A B C D ... A B C D ...
A 0 1 5 3 A 0 5 3 9
B 4 0 4 1 B 2 0 7 8
C 2 6 0 3 C 2 6 0 1
D 2 7 1 0 D 5 2 5 0
... ...
The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.
Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:
1.
import sklearn.cluster
import numpy as np
data = np.load('difference_matrix_file.npy') #loads difference matrix from file
a = np.array([x[0:] for x in data])
clust_centers = 3
model = sklearn.cluster.k_means(a, clust_centers)
print model
2.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)
3.
import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
np.set_printoptions(threshold=np.nan)
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids
What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.
However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.
I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?

You have two issues where, and the recommendation of k-means probably was not very good...
K-means expects a coordinate data matrix, not a distance matrix.
In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.
So you haven't understood the input of k-means, no wonder you do not understand the output.
I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with #Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.
Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:
sklearn.cluster,
scipy.cluster.vq?
In short,
the model sklearn.cluster.k_means() returns a tuple with three fields:
an array with the centroids (that should be 3x232 for you)
the label assignment for each point (i.e. a 232-long array with values 0-2)
and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
scipy.cluster.vq.kmeans2() returns a tuple with two fields:
the cluster centroids (as above)
the label assignment (as above)
kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().
As for how to get to the coordinates of the points in each cluster, you could:
for cc in range(clust_centers):
print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))
where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

Finding the dimension with highest variance using scikit-learn PCA

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them.
My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow:
pca = sklearn.decomposition.PCA()
pca.fit(data_matrix)
When I print pca.explained_variance_ratio_, it outputs an array of variance ratios ordered from highest to lowest, but it doesn't tell me which dimension from the data they correspond to (I've tried changing the order of columns on my matrix, and the resulting variance ratio array was the same).
Printing pca.components_ gives me a 4x4 matrix (I left the original number of components as argument to pca) with some values I can't understand the meaning of...according to scikit's documentation, they should be the components with the maximum variance (the eigenvectors perhaps?), but no sign of which dimension those values refer to.
Transforming the data doesn't help either, because the dimensions are changed in a way I can't really know which one they were originally.
Is there any way I can get this information with scikit's pca? Thanks

The pca.explained_variance_ratio_ returned are the variances from principal components. You can use them to find how many dimensions (components) your data could be better transformed by pca. You can use a threshold for that (e.g, you count how many variances are greater than 0.5, among others). After that, you can transform the data by PCA using the number of dimensions (components) that are equal to principal components higher than the threshold used. The data reduced to these dimensions are different from the data on dimensions in original data.
you can check the code from this link:
http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.