Semantically weighted mean of word embeddings - python

Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!

Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.

Related

Text similarity using Word2Vec

I would like to use Word2Vec to check similarity of texts.
I am currently using another logic:
from fuzzywuzzy import fuzz
def sim(name, dataset):
matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
return
(name is my column).
For applying this function I do the following:
df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)
Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?
Example of dataset:
Text
Hello, this is Peter, what would you need me to help you with today?
I need you
Good Morning, John here, are you calling regarding your cell phone bill?
Hi, this this is John. What can I do for you?
...
The first text and the last one are quite similar, although they have different words to express similar concept.
I would like to create a new column where to put, for each row, text that are similar.
I hope you can help me.
TLDR; skip to the last section (part 4.) for code implementation
1. Fuzzy vs Word embeddings
Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.
These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.
Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.
NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.
2. Similarity between word vectors / sentence vectors
“You shall know a word by the company it keeps”
Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.
A good way to find how similar 2 words vectors is cosine-similarity. Read more here.
3. Pre-trained word2vec models (and others)
The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.
You can check similarity between these sentence embeddings using cosine_similarity
4. Sample code implementation
I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data
from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.

In count vectorizer which axis to use?

I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
Axis=0 : Importance of the word in a document (row wise normalization).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.
By L2 normalization, do you mean division by the total count?
If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.
If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.

What is the meaning of "size" of word2vec vectors [gensim library]?

Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.

How do I calculate the semantic similarity between two n-grams?

I'm trying to calculate the semantic similarity between two bi-grams and I need to use fasttext's pre-trained word vectors to accomplish this task.
For ex :
The b-grams are python lists of two elements:
[his, name] and [I, am]
They are two tuples and I need to calculate the similarity between these two tuples by any means necessary.
I'm hoping there's a score which can give me a good approximation of similarity.
For ex - If there are methods which can tell me that [His, name] is more similar to [I, am] than [An, apple].
Right now I only made use of cosine similarity which does include any semantic similarity.
Cosine similarity might be useful if you average both word vectors in a bi-gram first. So you want to take the vector for 'his' and 'name', average them into one vector. Then take the vector for 'I' and 'am' and average them into one vector. Finally, calculate cosine similarity for both resulting vectors and it should give you a rough semantic similarity.

What is the output of Spark MLLIB LDA topicsmatrix?

The output of LDAModel.topicsMatrix() is unclear to me.
I think I understand the concept of LDA and that each topic is represented by a distribution over terms.
In the LDAModel.describeTopics() it is clear (I think):
The highest sum of likelihoods of words of a sentence per topic, indicates the evidence of this tweet belonging to a topic.
With n topics, the output of describeTopics() is a n times m matrix where m stands for the size of the vocabulary. The values in this matrix are smaller or equal to 1.
However in the LDAModel.topicsMatrix(), I have no idea what I am looking at. The same holds when reading the documentation.
The matrix is a m times n matrix, the dimensions have changed and the values in this matrix are larger than zero (and thus can take the value 2, which is not a probability value). What are these values? The occurrence of this word in the topic perhaps?
How do I use these values do calculate the distance of a sentence to a topic?
i think the matrix is m*n m is the words number and n is the topic number

Categories

Resources