Add-1 laplace smoothing for bigram implementation - python

I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I generally think I have the algorithm down, but my results are very skewed. I am aware that and-1 is not optimal (to say the least), but I just want to be certain my results are from the and-1 methodology itself and not my attempt.
Now, the And-1/Laplace smoothing technique seeks to avoid 0 probabilities by, essentially, taking from the rich and giving to the poor. Therefore, a bigram that is found to have a zero probability becomes:
1/V, V=the number of types
This means that the probability of every other bigram becomes:
P(B|A) = Count(W[i-1][W[i])/(Count(W[i-1])+V)
You would then take a sentence to test and break each into bigrams and test them against the probabilities (doing the above for 0 probabilities), then multiply them all together to get the final probability of the sentence occurring.
I am implementing this in Python. My code looks like this, all function calls are verified to work:
#return is a counter of tuples containing ngrams: {('A','B'):C}
#this means ('A','B') means (B|A) in probabilistic terms
bigrams[0]=getBigrams(corpus[0])
...
bigrams[n]=getBigrams(corpus[n])
#return is a dictionary of the form P['A']=C
unigrams[0]=getUnigrams(corpus[0])
...
unigrams[N]=getUnigrams(corpus[n])
#generate bigram probabilities, return is P('A','B')=p, add one is done
prob[0]=getAddOneProb(unigrams[0],bigrams[0])
...
prob(n)=getAddOneProb(unigrams[n],bigrams[n])
for sentence in test:
bi=getBigrams(sentence)
uni=getUnigrams(sentence)
P[0]=...=P[n]=1 #set to 1
for b in bi:
tup=b
try:
P[0]*=prob[tup]
except KeyError:
P[0]=(1/len(unigrams[0])
#do this for all corpora
At the then I would compare all corpora, P[0] through P[n] and find the one with the highest probability
My results aren't that great but I am trying to understand if this is a function of poor coding, incorrect implementation, or inherent and-1 problems.

Related

Gensim: word mover distance with string as input instead of list of string

I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?

What is the meaning of "size" of word2vec vectors [gensim library]?

Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.

Understanding and applying k-means clustering for topic modeling

I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.

Optimizing Gensim word mover's distance function for speed (wmdistance)

I am using gensim wmdistance for calculating similarity between a reference sentence and 1000 other sentences.
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
model.init_sims(replace=True)
reference_sentence = "it is a reference sentence"
other_sentences = [1000 sentences]
index = 0
for sentence in other_sentences:
distance [index] = model.wmdistance(refrence_sentence, other_sentences)
index = index + 1
According to gensim source code, model.wmdistance returns the following:
emd(d1, d2, distance_matrix)
where
d1 = # Compute nBOW representation of reference_setence.
d2 = # Compute nBOW representation of other_sentence (one by one).
distance_matrix = see the source code as its a bit too much to paste it here.
This code is inefficient in two ways for my use case.
1) For the reference sentence, it is repeatedly calculating d1 (1000 times) for the distance function emd(d1, d2, distance_matrix).
2) This distance function is called by multiple users from different points which repeat this whole process of model.wmdistance(doc1, doc2) for the same other_sentences and it is computationally expensive. For this 1000 comparisons, it takes around 7-8 seconds.
Therefore, I would like to isolate the two tasks. The final calculation of distance: emd(d1, d2, distance_matrix) and the preparation of these inputs: d1, d2, and distance matrix. As distance matrix depends on both so at least its input preparation should be isolated from the final matrix calculation.
My initial plan is to create three customized functions:
d1 = prepared1(reference_sentence)
d2 = prepared2(other_sentence)
distance_matrix inputs = prepare inputs
Is it possible to do this with this gensim function or should I just go my own customized version? Any ideas and solutions to deal with this problem in a better way?
You are right to observe that this code could be refactored & optimized to avoid doing repetitive operations, especially in the common case where one reference/query doc is evaluated against a larger set of documents. (Any such improvements would also be a welcome contribution back to gensim.)
Simply preparing single documents outside the calculation might not offer a big savings; in each case, all word-to-word distances between the two docs must be calculated. It might make sense to precalculate a larger distance_matrix (to the extent that the relevant vocabulary & system memory allows) that includes all words needed for many pairwise WMD calculations.
(As tempting as it might be to precalculate all word-to-word distances, with a vocabulary of 3 million words like the GoogleNews vector-set, and mere 4-byte float distances, storing them all would take at least 18TB. So calculating distances for relevant words, on manageable batches of documents, may make more sense.)
A possible way to start would be to create a variant of wmdistance() that explicitly works on one document versus a set-of-documents, and can thus combine the creation of histograms/distance-matrixes for many comparisons at once.
For the common case of not needing all WMD values, but just wanting the top-N nearest results, there's an optimization described in the original WMD paper where another faster calculation (called there 'RWMD') can be used to deduce when there's no chance a document could be in the top-N results, and thus skip the full WMD calculation entirely for those docs.

tf-idf : should I do normalization of documents length

When using TF-IDF to compare Document A, B
I know that length of document is not important.
But compared to A-B, A-C
in this case, I think the length of document B, C should be the same length.
for example
Log : 100 words
Document A : 20 words
Document B : 30 words
Log - A 's TF-IDF score : 0.xx
Log - B 's TF-IDF score : 0.xx
Should I do normalization of document A,B?
(If the comparison target is different, it seems to be a problem or wrong result)
Generally you want to do whatever gives you the best cross validated results on your data.
If all you are doing to compare them is taking cosine similarity then you have to normalize the vectors as part of the calculation but it won't affect the score on account of varying document lengths. Many general document retrieval systems consider shorter documents to be more valuable but this is typically handled as a score multiplier after the similarities have been calculated.
Oftentimes ln(TF) is used instead of raw TF scores as a normalization feature because differences between seeing a term 1and 2 times is way more important than the difference between seeing a term 100 and 200 times; it also keeps excessive use of a term from dominating the vector and is typically much more robust.

Categories

Resources