How to get contributions and squared cosines in sklearn PCA? - python

Working primarily based on this paper I want to implement the various PCA interpretation metrics mentioned - for example cosine squared and what the article calls contribution.
However the nomenclature here seems very confusing, namely it's not clear to me what exactly sklearns pca.components_ is. I've seen some answers here and in various blogs stating that these are loadings while others state it's component scores (which I assume is the same thing as factor scores).
The paper defines contribution (of observation to component) as:
and states all contributions for each component must add to 1, which is not the case assuming pca.explained_variance_ is the eigenvalues and pca.components_ are the factor scores:
df = pd.DataFrame(data = [
[0.273688,0.42720,0.65267],
[0.068685,0.008483,0.042226],
[0.137368, 0.025278,0.063490],
[0.067731,0.020691,0.027731],
[0.067731,0.020691,0.027731]
], columns = ["MeS","EtS", "PrS"])
pca = PCA(n_components=2)
X = pca.fit_transform(df)
ctr=(pd.DataFrame(pca.components_.T**2)).div(pca.explained_variance_)
np.sum(ctr,axis=0)
# Yields random values 0.498437 and 0.725048
How can I calculate these metrics? The paper defines cosine squared similarly as:

This paper does not play well with sklearn as far as definitions are concerned.
The pca.components_ are the two principal components of your data after your data is centered. And pca.fit_transform(df) gives you the components of your centered data set w.r.t. those two principal components, i.e., the factor scores.
> pca.fit_transform(df)
array([[ 0.60781787, -0.00280834],
[-0.1601333 , -0.01246807],
[-0.11667497, 0.04584743],
[-0.1655048 , -0.01528551],
[-0.1655048 , -0.01528551]])
Next, the lambda_l of equation (10) in the paper is just the sum of the squares of the factor scores for the l-th component, i.e. l-th column of pca.fit_transform(df). But pca.explained_variance_ gives you the two variances, and since sklearn uses as degrees of freedom the value len(df.index) - 1, we have lambda_l == (len(df.index)-1) pca.explained_variance_[l].
> X = pca.fit_transform(df)
> lmbda = np.sum(X**2, axis = 0)
> lmbda
array([0.46348196, 0.00273262])
> (5-1) * pca.explained_variance_
array([0.46348196, 0.00273262])
Thus, as a summary, I recommend computing the contributions as:
> ctr = X**2 / np.sum(X**2, axis = 0)
For the squared cosine it's the same except that we sum over the rows of pca.fit_transform(df):
> cos_sq = X**2 / np.sum(X**2, axis = 1)[:, np.newaxis]

Related

How to get the unscaled regression coefficients errors using statsmodels?

I'm trying to compute the coefficient errors of a regression using statsmodels. Also known as the standard errors of the parameter estimates. But I need to compute their "unscaled" version. I've only managed to do so with NumPy.
You can see the meaning of "unscaled" in the docs: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
cov bool or str, optional
If given and not False, return not just the estimate but also its covariance matrix.
By default, the covariance are scaled by chi2/dof, where dof = M - (deg + 1),
i.e., the weights are presumed to be unreliable except in a relative sense and
everything is scaled such that the reduced chi2 is unity. This scaling is omitted
if cov='unscaled', as is relevant for the case that the weights are w = 1/sigma, with
sigma known to be a reliable estimate of the uncertainty.
I'm using this data to run the rest of the code in this post:
import numpy as np
x = np.array([-0.841, -0.399, 0.599, 0.203, 0.527, 0.129, 0.703, 0.503])
y = np.array([1.01, 1.24, 1.09, 0.95, 1.02, 0.97, 1.01, 0.98])
sigmas = np.array([6872.26, 80.71, 47.97, 699.94, 57.55, 1561.54, 311.98, 501.08])
# The convention for weights are different
sm_weights = np.array([1.0/sigma**2 for sigma in sigmas])
np_weights = np.array([1.0/sigma for sigma in sigmas])
With NumPy:
coefficients, cov = np.polyfit(x, y, deg=2, w=np_weights, cov='unscaled')
# The errors I need to get
print(np.sqrt(np.diag(cov))) # [917.57938013 191.2100413 211.29028248]
If I compute the regression using statsmodels:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as smapi
polynomial_features = PolynomialFeatures(degree=2)
polynomial = polynomial_features.fit_transform(x.reshape(-1, 1))
model = smapi.WLS(y, polynomial, weights=sm_weights)
regression = model.fit()
# Get coefficient errors
# Notice the [::-1], statsmodels returns the coefficients in the reverse order NumPy does
print(regression.bse[::-1]) # [0.24532856, 0.05112286, 0.05649161]
So the values I get are different, but related:
np_errors = np.sqrt(np.diag(cov))
sm_errors = regression.bse[::-1]
print(np_errors / sm_errors) # [3740.2061481, 3740.2061481, 3740.2061481]
The NumPy documentation says the covariance are scaled by chi2/dof where dof = M - (deg + 1). So I tried the following:
degree = 2
model_predictions = np.polyval(coefficients, x)
residuals = (model_predictions - y)
chi_squared = np.sum(residuals**2)
degrees_of_freedom = len(x) - (degree + 1)
scale_factor = chi_squared / degrees_of_freedom
sm_cov = regression.cov_params()
unscaled_errors = np.sqrt(np.diag(sm_cov * scale_factor))[::-1] # [0.09848423, 0.02052266, 0.02267789]
unscaled_errors = np.sqrt(np.diag(sm_cov / scale_factor))[::-1] # [0.61112427, 0.12734931, 0.14072311]
What I notice is that the covariance matrix I get from NumPy is much larger than the one I get from statsmodels:
>>> cov
array([[ 841951.9188366 , -154385.61049538, -188456.18957375],
[-154385.61049538, 36561.27989418, 31208.76422516],
[-188456.18957375, 31208.76422516, 44643.58346933]])
>>> regression.cov_params()
array([[ 0.0031913 , 0.00223093, -0.0134716 ],
[ 0.00223093, 0.00261355, -0.0110361 ],
[-0.0134716 , -0.0110361 , 0.0601861 ]])
As long as I can't make them equivalent, I won't be able to get the same errors. Any idea of what the difference in scale could mean and how to make both covariance matrices equal?
statsmodels documentation is not well organized in some parts.
Here is a notebook with an example for the following
https://www.statsmodels.org/devel/examples/notebooks/generated/chi2_fitting.html
The regression models in statsmodels like OLS and WLS, have an option to keep the scale fixed. This is the equivalent to cov="unscaled" in numpy and scipy.
The statsmodels option is more general, because it allows fixing the scale at any user defined value.
https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html
We we have a model as defined in the example, either OLS or WLS, then using
regression = model.fit(cov_type="fixed scale")
will keep the scale at 1 and the resulting covariance matrix is unscaled.
Using
regression = model.fit(cov_type="fixed scale", cov_kwds={"scale": 2})
will keep the scale fixed at value two.
(some links to related discussion motivation are in https://github.com/statsmodels/statsmodels/pull/2137 )
Caution
The fixed scale cov_type will be used for inferential statistic that are based on the covariance of the parameter estimates, cov_params.
This affects standard errors, t-tests, wald tests and confidence and prediction intervals.
However, some other results statistics might not be adjusted to use the fixed scale instead of the estimated scale, e.g. resid_pearson.
https://github.com/statsmodels/statsmodels/issues/8190

statsmodels PCA eigenvalues sum

When I apply statsmodels.multivariate.pca.PCA to some data, I am finding that the sum of the produced eigenvalues does not equal to the total variance of the data. I am using the following code
import numpy as np
import statsmodels.api as sm
corr_matrix = np.array([
[1, 0.8, 0.4],
[0.8, 1, 0.6],
[0.4, 0.6, 1]])
Z = np.random.multivariate_normal([0,0,0], corr, 1000)
pc = sm.PCA(Z, standardize=False, demean=False, normalize=False)
pc.eigenvals.sum()
and the result (in a given random sample) is 2994.51488403581 while I was expecting this to add up to 3.
What am I missing?
Add 1
It seems that when the PCA is performed on the data X (i.e. using the matrix X^TX), the relationship between sum of variances and eigenvalues no longer holds, and it is only when the PCA is performed on the covariance matrix (i.e. on X^TX/n) when the sum of eigenvalues is eual to the sum of variances, i.e. trace(X^TX/n) = sum(eigenvalues). I wish this was more clearly stated on all the post one finds on PCA.
The eigenvalues are not the variance of the data. eigenvalues are the variances of the data in specific direction, defined by eigenvectors. The Variance of the data is the sum of the distance of all points to the mean value of the data. PC's are the characteristic of data and shows how the data is expanded in the space in specific directions. You should not confuse the variance of the data with eigenvalue (which shows the variance in the direction of the eigenvector).
quick answer by reverse engineering (I don't remember the details)
pc = PCA(Z, standardize=False, demean=True, normalize=False)
​
pc.eigenvals.sum() / 1000
2.7550787264061087
Z.var(0).sum()
2.7550787264061087
In the computation of the variance, the data is demeaned. If we don't demean, then we only get a uncentered quadratic product.
pc = PCA(Z, standardize=False, demean=False, normalize=False)
​
pc.eigenvals.sum(), pc.eigenvals.sum() / Z.shape[0]
(2756.1915877060546, 2.7561915877060548)
(Z**2).mean(0).sum()
2.7561915877060548

Batch Gradient Descent for Logistic Regression

I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).
I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.
import numpy as np
import sig as s
def batchlogreg(X, y):
max_iterations = 800
alpha = 0.00001
(m,n) = np.shape(X)
X = np.insert(X, 0, 1, 1)
theta = np.array([0] * (n+1), 'float')
ll = np.array([0] * max_iterations, 'float')
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta))
d = y - hx
theta = theta + alpha*np.dot(np.transpose(X),d)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
return (theta, ll)
Note that the sigmoid function has:
sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5
Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
d = y - hx # then this will be "very" negative when y is 0
theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.
Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.
Possible solution: increase alpha and / or the number of iterations, or use momentum.
It sounds like you could be confusing probabilities with assignments.
The probability will be a real number between 0.0 and 1.0. A label will be an integer (0 or 1). Logistic regression is a model that provides the probability of a label being 1 given the input features. To obtain a label value, you need to make a decision using that probability. An easy decision rule is that the label is 0 if the probability is less than 0.5, and 1 if the probability is greater than or equal to 0.5.
So, for the example you gave, the decisions would both be 1 (which means the model is wrong for the first example where it should be 0).
I came to the same question and found the reason.
Normalize X first or set a scale-comparable intercept like 50.
Otherwise contours of cost function are too "narrow". A big alpha makes it overshoot and a small alpha fails to progress.

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_
You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.
You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])
Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.
I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.
a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37
It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Statsmodels: Calculate fitted values and R squared

I am running a regression as follows (df is a pandas dataframe):
import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()
Which gave me, among others, an R-squared of 0.942. So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values:
orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()
This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9. Therefore, I tried to calculate it manually myself:
yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()
1 - SSReg/SSTot
Out[79]: 0.2618159806908984
Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot, SSReg have values of 48084, 35495.
If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.
tss = (ys ** 2).sum() # un-centred total sum of squares
as opposed to
tss = ((ys - ys.mean())**2).sum() # centred total sum of squares
as a result, R-squared would be much higher.
This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:
ys = beta1 . xs + beta0 + noise
then the reduced model can be: ys = beta0 + noise, where the estimate for beta0 is the sample average, thus we have: noise = ys - ys.mean(). That is where de-meaning comes from in a model with intercept.
But from a model like:
ys = beta . xs + noise
you may only reduce to: ys = noise. Since noise is assumed zero-mean, you may not de-mean ys. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.
This is documented here under rsquared item. Set yBar equal to zero, and I would expect you will get the same number.
If your model is:
a = <yourmodel>.fit()
Then, to compute fitted values:
a.fittedvalues
and to compute R squared:
a.rsquared

Categories

Resources