PCA on word2vec embeddings

PCA on word2vec embeddings - python

I am trying to reproduce the results of this paper: https://arxiv.org/pdf/1607.06520.pdf
Specifically this part:
To identify the gender subspace, we took the ten gender pair difference vectors and computed its principal components (PCs). As Figure 6 shows, there is a single direction that explains the majority of variance in these vectors. The first eigenvalue is significantly larger than the rest.
I am using the same set of word vectors as the authors (Google News Corpus, 300 dimensions), which I load into word2vec.
The 'ten gender pair difference vectors' the authors refer to are computed from the following word pairs:
I've computed the differences between each normalized vector in the following way:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin', binary = True)
model.init_sims()
pairs = [('she', 'he'),
('her', 'his'),
('woman', 'man'),
('Mary', 'John'),
('herself', 'himself'),
('daughter', 'son'),
('mother', 'father'),
('gal', 'guy'),
('girl', 'boy'),
('female', 'male')]
difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])
I then perform PCA on the resulting matrix, with 10 components, as per the paper:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(difference_matrix)
However I get very different results when I look at pca.explained_variance_ratio_ :
array([ 2.83391436e-01, 2.48616155e-01, 1.90642492e-01,
9.98411858e-02, 5.61260498e-02, 5.29706681e-02,
2.75670634e-02, 2.21957722e-02, 1.86491774e-02,
1.99108478e-32])
or with a chart:
The first component accounts for less than 30% of the variance when it should be above 60%!
The results I get are similar to what I get when I try to do the PCA on randomly selected vectors, so I must be doing something wrong, but I can't figure out what.
Note: I've tried without normalizing the vectors, but I get the same results.

They released the code for the paper on github: https://github.com/tolga-b/debiaswe
Specifically, you can see their code for creating the PCA plot in this file.
Here is the relevant snippet of code from that file:
def doPCA(pairs, embedding, num_components = 10):
matrix = []
for a, b in pairs:
center = (embedding.v(a) + embedding.v(b))/2
matrix.append(embedding.v(a) - center)
matrix.append(embedding.v(b) - center)
matrix = np.array(matrix)
pca = PCA(n_components = num_components)
pca.fit(matrix)
# bar(range(num_components), pca.explained_variance_ratio_)
return pca
Based on the code, looks like they are taking the difference between each word in a pair and the average vector of the pair. To me, it's not clear this is what they meant in the paper. However, I ran this code with their pairs and was able to recreate the graph from the paper:

To expand on oregano's answer:
For each pair, a and b, they calculate the center, c = (a + b) / 2 and then include vectors pointing in both directions, a - c and b - c.
The reason this is critical is that PCA gives you the vector along which the most variance occurs. All of your vectors point in the same direction, so there is very little variance in precisely the direction you are trying to reveal.
Their set includes vectors pointing in both directions in the gender subspace, so PCA clearly reveals gender variation.

Related

PCA from scratch and Sklearn PCA giving different output

I am trying to implement PCA from scratch. Following is the code:
sc = StandardScaler() #standardization
X_new = sc.fit_transform(X)
Z = np.divide(np.dot(X_new.T,X_new),X_new.shape[0]) #covariance matrix
eig_values, eig_vectors = np.linalg.eig(Z) #eigen vectors calculation
eigval_sorted = np.sort(eig_values)[::-1]
ev_index =np.argsort(eigval_sorted)[::-1]
pc = eig_vectors[:,ev_index] #eigen vectors sorts on the basis of eigen values
W = pc[:,0:2] #extracting 2 components
print(W)
and getting the following components:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
When I use the sklearn's PCA I get the following two components:
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105],
[ 0.37231836, 0.92555649, 0.02109478, 0.06541577]])
Projection onto new feature space gives following different figures:
Where am I doing it wrong and what can be done to resolve the problem?

The result of a PCA are technically not n vectors, but a subspace of dimension n. This subspace is represented by n vectors that span that subspace.
In your case, while the vectors are different, the spanned subspace is the same, so the result of the PCA is the same.
If you want to align your solution perfectly with the sklearn solution, you need to normalise your solution in the same way. Apparently sklearn prefers positive values over negative values? You'd need to dig into their documentation.
edit:
Yes, of course, what I wrote is wrong. The algorithm itself returns ordered orthonormal basis vectors. So vectors that are of length one and orthogonal to each other and they are ordered in their 'importance' to the dataset. So way more information than just the subspace.
However, if v, w, u are a solution of the PCA, so should +/- v, w, u be.
edit: It seems that np.linalg.eig has no mechanism to guarantee it will also return the same set of eigenvectors representing the eigenspace, see also here:
NumPy linalg.eig
So, a new version of numpy, or just how the stars are aligned today, can change your result. Although, for a PCA it should only vary in +/-

compare data with ideal values for game scoring

I am using a calculating score based on the cosine similarity of the ideal values array and data collected array. (code below)
However, when I run the following code , the result is 99.4 which I think is weird because as 150 is very different with the ideal value 300.
import numpy as np
def cos_sim(speechrate, pitch): #speechrate and pitch are the data collected
v1 = np.array([300, 25]) #array of ideal values
v2 = np.array([speechrate, pitch]) #array of data
similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print("{:.1f}".format(similarity*100))
cos_sim(150, 23)
Does anyone have any idea how to calculate the score based on the difference of the values? (not necessarily must use cosine similarity)

Your formula for similarity calculates the between the vectors (300,25) and (150,23), or in other words measures the cosine of the angle between them.
If you look at the following graph, there isn't much angle between the two vectors.
In fact, degrees, which is not much different from 0 degrees where cos has the highest value of 1.
The metric you use here should depend on your definition of similarity. A trivial metric you can use is the Euclidean distance between the two points.
Euclidean distance between these two points is d = 150.01. And for instance between (300, 25) and (280,23) is d = 20.09 which gives you an idea about how separated they are in a 2D plane.

Sklearn LDA-analysis won't generate 2 dimensions

I'm trying to plot a 3-feature dataset with a binary classification on a matplotlib plot. This worked with an example dataset provided in a guide (http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/) but when I try to instead insert my own dataset, the LinearDiscriminantAnalysis will only output a one-dimensional series, no matter what number I put in "n_components". Why would this not work with my own code?
Data = pd.read_csv("DataFrame.csv", sep=";")
x = Data.iloc[:, [3, 5, 7]]
y = Data.iloc[:, 8]
lda = LDA(n_components=2)
lda_transformed = pd.DataFrame(lda.fit_transform(x, y))
plt.scatter(lda_transformed[y==0][0], lda_transformed[y==0][1], label='Loss', c='red')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1], label='Win', c='blue')
plt.legend()
plt.show()

In the case when the number of different class labels, C, is less than the number of observations (almost always), then linear discriminant analysis will always produce C - 1 discriminating components. Using n_components from the sklearn API is only a means to choose possibly fewer components, e.g. in the case when you know what dimensionality you'd like to reduce down to. But you could never use n_components to get more components.
This is discussed in the Wikipedia section on Multiclass LDA. The definition of the between-class scatter is given as
\Sigma_{b} = (1 / C) \sum_{i}^{C}( (\mu_{i} - mu)(\mu_{i} - mu)^{T}
which is the empirical covariance matrix among the population of class means. By definition, such a covariance matrix has rank at most C - 1.
... the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the C − 1 largest eigenvalues ...
So because LDA uses a decomposition of the class mean covariance matrix, it means the dimensionality reduction it can provide is based on the number of class labels, and not on the sample size nor the feature dimensionality.
In the example you linked, it doesn't matter how many features there are. The point is that the example uses 3 simulated cluster centers, so there are 3 class labels. This means linear discriminant analysis could produce projection of the data onto either 1-dimensional or 2-dimensional discriminating subspaces.
But in your data, you start out with only 2 class labels, a binary problem. This means the dimensionality of the linear discriminant model can be at most 1-dimensional, literally a line that forms the decision boundary between the two classes. Dimensionality reduction with LDA in this case would simply be the projection of data points onto a particular normal vector of that separating line.
If you want to specifically reduce down to two dimensions, you can try many of the other algorithms that sklearn provides: t-SNE, ISOMAP, PCA and kernel PCA, random projection, multi-dimensional scaling, among others. Many of these allow you to choose the dimensionality of the projected space, up to the original feature dimensionality, or sometimes you can even project into larger spaces, like with kernel PCA.

In the example you give, dimension reduction by LDA reduces the data from 13 features to 2 features, however in your example it reduces from 3 to 1 (even though you wanted to get 2 features), thus it is not possible to plot in 2D.
If you really want to select 2 features out of 3, you can use feature_selection.SelectKBest to choose 2 best features and there won't be any problems plotting in 2D.
For more information, please read this fantastic answer for PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Probably it's because of sklearn implementation that won't allowed you to do so if you only have 2 class. The problem has been stated in here, https://github.com/scikit-learn/scikit-learn/issues/1967.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?

You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)

It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_

You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])

Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.

I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.

a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37

It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PCA on word2vec embeddings - python

Related

PCA from scratch and Sklearn PCA giving different output

compare data with ideal values for game scoring

Sklearn LDA-analysis won't generate 2 dimensions

How to change parameters of a scikit learn function dynamically i.e. find best parameter

scikit-learn: Finding the features that contribute to each KMeans cluster

Categories

Resources