Nearest Neighbours of OOV are also OOV - FastText - python

I am trying to get the nearest neighbors of an out-of-vocabulary (OOV) word in fasttext, however, it appears that also the nearest neighbors are OOV as well? This is the code I'm using:
# get the intersection of the vocabularies of both models
all_vocab = []
all_vocab.append(model1.words)
all_vocab.append(model2.words)
# get the intersection of all vocabulary
common_vocab = list(set.intersection(*map(set, all_vocab)))
print('len of common vocab: {}'.format(len(common_vocab)))
# len of common vocab: 112251
nnsims1 = model1.get_nearest_neighbors(w, k)
nnsims2 = model2.get_nearest_neighbors(w, k)
nn1 = [n[1] for n in nnsims1 if n in model1.words]
nn2 = [n[1] for n in nnsims2 if n in model2.words]
print(len(nn1) == len(nnsims1)) # False
print(len(nnsims1), len(nn1)) # 50 0
print(len(nn2) == len(nnsims2)) # False
print(len(nnsims2), len(nn2)) # 50 0
My interpretation of that is that if the word is OOV, then its vector is an aggregation of some sub-words, and since this "aggregated representation" is not in the vocabulary, then it has no neighbors from that vocabulary. However, how are the neighbors for an OOV word generated? I can't seem to find any explanation from FastText's documentation.

Models that are trained separately will not have compatible coordinate spaces, unless you take some other steps to force that to be the case. (There's enough randomness in the model initialization, & training, & effects of different training data & even arbitrary thread-ordering during multithreaded training, that each training session is essentially creating its own separate, but internally-sensible, 'space'.)
So even a simple word like 'hot' could be in an arbitrarily different place, and the relative distances/directions are only meaningful inside the same model.
Thus I suspect what's happening (without totally stepping through your code) is that for any OOV word from one model, all of the nearest in-vocabulary words (of that same model) are still OOV in the other model.
To the extent the language/word-senses in both corpora are compatible, you may want to train all texts/words into a single combined model.
(There are other 'alignment'/translation techniques you could also consider, if the models have some set of shared 'anchor' words that you are willing to stipulate should have identical coordinates - but that's a more complicated & probably 'looser' process than just getting all the data/words into one unified model.)

Related

Ignore out-of-vocabulary words when averaging vectors in Spacy

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.
To do this I use the following code:
import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector
Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.
However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.
I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean
import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)
Approaches tried
In both cases documents is a list with 200 string elements with ≈ 400 words each.
Without dealing with OOV terms:
import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [document.vector for document in list(nlp.pipe(documents))]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s
Ignoring OOV terms in output vector. Note that in this case we need to add an extra 'if' statment for those cases in which all words are OOV (if this happens the output vector is r_vec):
r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
if vectors.size == 0:
# Case in which none of the words in text were in vocabulary
avg_vector = r_vec
else:
avg_vector = vectors.mean(axis=0)
return avg_vector
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [get_vector(document) for document in documents]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s
In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.
I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.
see this post by the author of Spacy which says:
The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.
Try this for example:
import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np
sentence = "I love Stack Overflow butitsalsodistractive"
print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])
np.array_equal(tokens.vector, tokensClean.vector)
#False
If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)

What is the meaning of "size" of word2vec vectors [gensim library]?

Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.

Understanding and applying k-means clustering for topic modeling

I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.

Compute probability of sentence with out of vocabulary words

I trained Ngram language models (unigram and bigram) on a corpus of English and I'm trying to compute the probabilities of sentences from a disjoint corpus.
For example, the training corpus consists of the 3 sentences:
1: I, am, Sam
2: Sam, I, am
3: I, do, not, like, green, eggs, and, ham
N = 14 (length of the corpus)
For unigram, I end up with probabilities:
Pr("i") = #("i") / N = 3/14, Pr("am") = 2/14, Pr("like") = 1/14, and so forth...
For bigram, I end up with probabilities:
Pr("am"|"i") = 2/3, Pr("do"|"i") = 1/3, and so forth...
Now, I'm trying to compute the probability of the following sentence where not all ngrams (uni or bi) appear in the training corpus:
I, ate, a, burrito
For unigram, I need the following probability estimates:
Pr("i"), Pr("ate"), Pr("a"), and Pr("burrito")
and for bigram, I need the following probabilities estimates:
Pr("ate"|"i"), Pr("a"|"ate"), Pr("burrito"|"a")
Apparently not all unigrams ("ate", "burrito") and bigrams (like ("i", "ate")) appear in the training corpus.
I understand that you can do smoothing (like add-one smoothing) to deal with these cases:
For example, the vocabulary of the training corpus is
i, am, sam, do, not, like, green, eggs, and, ham
and you can expand the vocabulary by including new words from the new sentence:
ate, a, burrito
So the size of the expanded vocabulary would be V = 13
So for unigram, the original probability estimates Pr(w_i) = #(w_i)/N would be turned into (#(w_i) + 1) / (N + V)
So Pr("i") = 4/27, Pr("am") = 3/27, Pr("sam") = 3/27, Pr("do") = 2/27, Pr("not") = 2/27, Pr("like") = 2/27, Pr("green") = 2/27, Pr("eggs") = 2/27, Pr("and") = 2/27, Pr("ham") = 2/27
And for the 3 new words:
Pr("ate") = 1/27, Pr("a") = 1/27, Pr("burrito") = 1/27
And the these probabilities would still sum to 1.0
Though this can handle the cases where some ngrams were not in the original training set, you would have to know the set of "new" words when you estimate the probabilities using (#(w_i) + 1) / (N + V) (V = sum of vocabulary of the original training set (10), and the test corpus (3)). I think this is equivalent to assuming the all new unigram or bigram in the test corpus occur only once, no matter how many times they actually occur.
My question is this the way out-of-vocabulary tokens are typically handled when computing the probability of a sentence?
The NLTK module nltk.module.NGramModel seem be have been removed due to bugs nltk ngram model, so I have to implement on my own. Another question: is there python modules other than NLTK that implements Ngram training and computing probability of a sentence ?
Thanks in advance!
My answer is based on a solution in "Speech and Language Processing" Jurafsky & Martin, on a scenario which you are building your vocabulary based on your training data (you have an empty dictionary).
In this case, you treat any first instance of a new word out of vocabulary (OOV) as an unknown token <UNK>.
This way all rare words will be one token similar to unseen words. To understand the reason consider the fact that one instance is not enough for your model to decide based on that. This way the unknown token also helps your accuracy on seen tokens as well.
I found this pdf version:
https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
About your second question, I think with a tweak and preprocessing on your text you can use CountVectorizer in scikit-learn:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Add-1 laplace smoothing for bigram implementation

I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I generally think I have the algorithm down, but my results are very skewed. I am aware that and-1 is not optimal (to say the least), but I just want to be certain my results are from the and-1 methodology itself and not my attempt.
Now, the And-1/Laplace smoothing technique seeks to avoid 0 probabilities by, essentially, taking from the rich and giving to the poor. Therefore, a bigram that is found to have a zero probability becomes:
1/V, V=the number of types
This means that the probability of every other bigram becomes:
P(B|A) = Count(W[i-1][W[i])/(Count(W[i-1])+V)
You would then take a sentence to test and break each into bigrams and test them against the probabilities (doing the above for 0 probabilities), then multiply them all together to get the final probability of the sentence occurring.
I am implementing this in Python. My code looks like this, all function calls are verified to work:
#return is a counter of tuples containing ngrams: {('A','B'):C}
#this means ('A','B') means (B|A) in probabilistic terms
bigrams[0]=getBigrams(corpus[0])
...
bigrams[n]=getBigrams(corpus[n])
#return is a dictionary of the form P['A']=C
unigrams[0]=getUnigrams(corpus[0])
...
unigrams[N]=getUnigrams(corpus[n])
#generate bigram probabilities, return is P('A','B')=p, add one is done
prob[0]=getAddOneProb(unigrams[0],bigrams[0])
...
prob(n)=getAddOneProb(unigrams[n],bigrams[n])
for sentence in test:
bi=getBigrams(sentence)
uni=getUnigrams(sentence)
P[0]=...=P[n]=1 #set to 1
for b in bi:
tup=b
try:
P[0]*=prob[tup]
except KeyError:
P[0]=(1/len(unigrams[0])
#do this for all corpora
At the then I would compare all corpora, P[0] through P[n] and find the one with the highest probability
My results aren't that great but I am trying to understand if this is a function of poor coding, incorrect implementation, or inherent and-1 problems.

Categories

Resources