compare data with ideal values for game scoring - python

I am using a calculating score based on the cosine similarity of the ideal values array and data collected array. (code below)
However, when I run the following code , the result is 99.4 which I think is weird because as 150 is very different with the ideal value 300.
import numpy as np
def cos_sim(speechrate, pitch): #speechrate and pitch are the data collected
v1 = np.array([300, 25]) #array of ideal values
v2 = np.array([speechrate, pitch]) #array of data
similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print("{:.1f}".format(similarity*100))
cos_sim(150, 23)
Does anyone have any idea how to calculate the score based on the difference of the values? (not necessarily must use cosine similarity)

Your formula for similarity calculates the between the vectors (300,25) and (150,23), or in other words measures the cosine of the angle between them.
If you look at the following graph, there isn't much angle between the two vectors.
In fact, degrees, which is not much different from 0 degrees where cos has the highest value of 1.
The metric you use here should depend on your definition of similarity. A trivial metric you can use is the Euclidean distance between the two points.
Euclidean distance between these two points is d = 150.01. And for instance between (300, 25) and (280,23) is d = 20.09 which gives you an idea about how separated they are in a 2D plane.

Related

Classifying a sample in to cluster by distance between its centroid

I have data that are labelled it into two clusters (eg: postive, negative). I will have new sample data, and based on the shortest distance from the centroids of two clusters I want to classify the new sample as positive or negative. For those i could find examples of KMeans. My sample data of positive looks like this
x1 = np.array([ 0.170755, ...... 0.313704, 0.10206 ]) # 56 dimensions
x2 = np.array([-0.371852, ...... 0.255809, 0.475981])
.... x28
Now I am trying to calculate centroid of my cluster using the example mentioned in the above mentioned link.
X = np.array(list(zip(x,x2,x3, ..., x28))).reshape(len(x1),28)
kmeans_model = KMeans(n_clusters=1).fit(X)
As I know this data belongs to one cluster i gave the value of n_clusters=1, but when I am trying to print the value of centroid kmeans_model.cluster_centers_ it gives me array like [[0.02490224, 0.12898346]], but I am expecting a array in same dimension as x1. Am I calculating the centroid correctly or did my basic understanding go wrong ?
In that case how will be able to calculate the distance between that centroid and new sample which is similar to x1 ?

Calculation of Mahalanobis distance doesn't work when using scipy's cdist function

I want to calculate the mahalanobis distance between every row of a matrix and a single row vector. Originally, I used np.tile to create a matrix that contains multiple copies of the vector to create an equal shaped matrix and then used sklearn.metrics.pairwise.paired_distances to calculate the distances row-wise.
I wanted to know if there's a more readable alternative to this, that is why I asked this question a few weeks ago. User Divakar came up with the idea to use cdist(a,b).ravel(). I found this alternative more readable and therefore used it. I tested out both functions with metric='euclidean' and found that they produce the same result. However, when I switch the metric to metric='mahalanobis' (which I want to use now), the first method seems to work but the cdist alternative throws the following error:
ValueError: The number of observations (3) is too small; the
covariance matrix is singular. For observations with 3 dimensions, at
least 4 observations are required.
Can someone explain what is going on here? Why do the functions give the same results when using euclidean distance but not when using Mahalanobis distance?
Here's my code:
import numpy as np
from sklearn.metrics.pairwise import paired_distances
from scipy.spatial.distance import cdist
# get matrix a and vector b
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7],[8],[9]]).transpose()
# define metric
METRIC = 'mahalanobis'
# define distance function that computes the distance between each row of a matrix
# and another matrix that contains duplicates of the single input vector.
def vector_to_matrix_distance(matrix_array,vector_array,metric):
vector_array_tiled = np.tile(vector_array,(matrix_array.shape[0],1))
return paired_distances(matrix_array,vector_array_tiled)
# run distance function
distances_1 = vector_to_matrix_distance(matrix_array=a,
vector_array=b,
metric=METRIC)
# Use cdist to do the SAME calculation.This works when METRIC = 'euclidean' but
# not when METRIC = 'mahalanobis'
distances_2 = cdist(a,b,metric=METRIC).ravel()

PCA on word2vec embeddings

I am trying to reproduce the results of this paper: https://arxiv.org/pdf/1607.06520.pdf
Specifically this part:
To identify the gender subspace, we took the ten gender pair difference vectors and computed its principal components (PCs). As Figure 6 shows, there is a single direction that explains the majority of variance in these vectors. The first eigenvalue is significantly larger than the rest.
I am using the same set of word vectors as the authors (Google News Corpus, 300 dimensions), which I load into word2vec.
The 'ten gender pair difference vectors' the authors refer to are computed from the following word pairs:
I've computed the differences between each normalized vector in the following way:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin', binary = True)
model.init_sims()
pairs = [('she', 'he'),
('her', 'his'),
('woman', 'man'),
('Mary', 'John'),
('herself', 'himself'),
('daughter', 'son'),
('mother', 'father'),
('gal', 'guy'),
('girl', 'boy'),
('female', 'male')]
difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])
I then perform PCA on the resulting matrix, with 10 components, as per the paper:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(difference_matrix)
However I get very different results when I look at pca.explained_variance_ratio_ :
array([ 2.83391436e-01, 2.48616155e-01, 1.90642492e-01,
9.98411858e-02, 5.61260498e-02, 5.29706681e-02,
2.75670634e-02, 2.21957722e-02, 1.86491774e-02,
1.99108478e-32])
or with a chart:
The first component accounts for less than 30% of the variance when it should be above 60%!
The results I get are similar to what I get when I try to do the PCA on randomly selected vectors, so I must be doing something wrong, but I can't figure out what.
Note: I've tried without normalizing the vectors, but I get the same results.
They released the code for the paper on github: https://github.com/tolga-b/debiaswe
Specifically, you can see their code for creating the PCA plot in this file.
Here is the relevant snippet of code from that file:
def doPCA(pairs, embedding, num_components = 10):
matrix = []
for a, b in pairs:
center = (embedding.v(a) + embedding.v(b))/2
matrix.append(embedding.v(a) - center)
matrix.append(embedding.v(b) - center)
matrix = np.array(matrix)
pca = PCA(n_components = num_components)
pca.fit(matrix)
# bar(range(num_components), pca.explained_variance_ratio_)
return pca
Based on the code, looks like they are taking the difference between each word in a pair and the average vector of the pair. To me, it's not clear this is what they meant in the paper. However, I ran this code with their pairs and was able to recreate the graph from the paper:
To expand on oregano's answer:
For each pair, a and b, they calculate the center, c = (a + b) / 2 and then include vectors pointing in both directions, a - c and b - c.
The reason this is critical is that PCA gives you the vector along which the most variance occurs. All of your vectors point in the same direction, so there is very little variance in precisely the direction you are trying to reveal.
Their set includes vectors pointing in both directions in the gender subspace, so PCA clearly reveals gender variation.

Using K-means with cosine similarity - Python

I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).
I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.
Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!
Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:
def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
# Manually override euclidean
def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
#return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
return np.arccos(cosine_similarity(X, Y))/np.pi
k_means_.euclidean_distances = euc_dist
kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
_ = kmeans.fit(sparse_data)
return kmeans
I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step() in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?
So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance.
So in terms of using k-means, simply do:
length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
And if you need the centroids and distance matrix do:
len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances
Notes:
Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is).
TLDR: Use the spherecluster package as #σηγ pointed out.
You can normalize your data and then use KMeans.
from sklearn import preprocessing
from sklearn.cluster import KMeans
kmeans = KMeans().fit(preprocessing.normalize(X))
Unfortunately no.
Sklearn current implementation of k-means only uses Euclidean distances.
The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.
If you want to use K-means with cosine distance, you need to make your own function or class. Or, try to use other clustering algorithm such as DBSCAN.

Calculating the Angle Between Vectors by using a vector as a reference point:

I have been trying to find a fast algorithm of calculating all the angle between n vectors that are of length x. For example if x=3 and n=4, my data would look something like this:
A: [1,2,3]
B: [2,3,4]
C: [...]
D: [...]
I was wondering is it acceptable to find the the angle between all of be vectors (A,B,C,D) with respect to some fix vector (i.e. X:[100,100,100,100]) and then the subtract the angles of (A,B,C,D) found with respect to that fixed value, to find the angle between all of them. I want to do this because I would only have to compute the angle once and then I can subtract angles all of my vectors to find the different between them. In short, I want to know is it safe to make this assumption?
angle_between(A,B) == angle_between(A,X) - angle_between(B,X)
and the angle_between function is the Cosine similarity.
That approach will only work for 2-D vectors. For higher dimensions any two vectors will define a hyperplane, and only if the third (reference) vector also lies within this hyperplane will your approach work. Unfortunately instead of only calculating n angles and subtracting, in order to determine the angles between each pair of vectors you would have to calculate all n choose 2 of them.

Categories

Resources