When using TF-IDF to compare Document A, B
I know that length of document is not important.
But compared to A-B, A-C
in this case, I think the length of document B, C should be the same length.
for example
Log : 100 words
Document A : 20 words
Document B : 30 words
Log - A 's TF-IDF score : 0.xx
Log - B 's TF-IDF score : 0.xx
Should I do normalization of document A,B?
(If the comparison target is different, it seems to be a problem or wrong result)
Generally you want to do whatever gives you the best cross validated results on your data.
If all you are doing to compare them is taking cosine similarity then you have to normalize the vectors as part of the calculation but it won't affect the score on account of varying document lengths. Many general document retrieval systems consider shorter documents to be more valuable but this is typically handled as a score multiplier after the similarities have been calculated.
Oftentimes ln(TF) is used instead of raw TF scores as a normalization feature because differences between seeing a term 1and 2 times is way more important than the difference between seeing a term 100 and 200 times; it also keeps excessive use of a term from dominating the vector and is typically much more robust.
Related
I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?
I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
Axis=0 : Importance of the word in a document (row wise normalization).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.
By L2 normalization, do you mean division by the total count?
If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.
If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.
Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.
I am using gensim wmdistance for calculating similarity between a reference sentence and 1000 other sentences.
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
model.init_sims(replace=True)
reference_sentence = "it is a reference sentence"
other_sentences = [1000 sentences]
index = 0
for sentence in other_sentences:
distance [index] = model.wmdistance(refrence_sentence, other_sentences)
index = index + 1
According to gensim source code, model.wmdistance returns the following:
emd(d1, d2, distance_matrix)
where
d1 = # Compute nBOW representation of reference_setence.
d2 = # Compute nBOW representation of other_sentence (one by one).
distance_matrix = see the source code as its a bit too much to paste it here.
This code is inefficient in two ways for my use case.
1) For the reference sentence, it is repeatedly calculating d1 (1000 times) for the distance function emd(d1, d2, distance_matrix).
2) This distance function is called by multiple users from different points which repeat this whole process of model.wmdistance(doc1, doc2) for the same other_sentences and it is computationally expensive. For this 1000 comparisons, it takes around 7-8 seconds.
Therefore, I would like to isolate the two tasks. The final calculation of distance: emd(d1, d2, distance_matrix) and the preparation of these inputs: d1, d2, and distance matrix. As distance matrix depends on both so at least its input preparation should be isolated from the final matrix calculation.
My initial plan is to create three customized functions:
d1 = prepared1(reference_sentence)
d2 = prepared2(other_sentence)
distance_matrix inputs = prepare inputs
Is it possible to do this with this gensim function or should I just go my own customized version? Any ideas and solutions to deal with this problem in a better way?
You are right to observe that this code could be refactored & optimized to avoid doing repetitive operations, especially in the common case where one reference/query doc is evaluated against a larger set of documents. (Any such improvements would also be a welcome contribution back to gensim.)
Simply preparing single documents outside the calculation might not offer a big savings; in each case, all word-to-word distances between the two docs must be calculated. It might make sense to precalculate a larger distance_matrix (to the extent that the relevant vocabulary & system memory allows) that includes all words needed for many pairwise WMD calculations.
(As tempting as it might be to precalculate all word-to-word distances, with a vocabulary of 3 million words like the GoogleNews vector-set, and mere 4-byte float distances, storing them all would take at least 18TB. So calculating distances for relevant words, on manageable batches of documents, may make more sense.)
A possible way to start would be to create a variant of wmdistance() that explicitly works on one document versus a set-of-documents, and can thus combine the creation of histograms/distance-matrixes for many comparisons at once.
For the common case of not needing all WMD values, but just wanting the top-N nearest results, there's an optimization described in the original WMD paper where another faster calculation (called there 'RWMD') can be used to deduce when there's no chance a document could be in the top-N results, and thus skip the full WMD calculation entirely for those docs.
I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I generally think I have the algorithm down, but my results are very skewed. I am aware that and-1 is not optimal (to say the least), but I just want to be certain my results are from the and-1 methodology itself and not my attempt.
Now, the And-1/Laplace smoothing technique seeks to avoid 0 probabilities by, essentially, taking from the rich and giving to the poor. Therefore, a bigram that is found to have a zero probability becomes:
1/V, V=the number of types
This means that the probability of every other bigram becomes:
P(B|A) = Count(W[i-1][W[i])/(Count(W[i-1])+V)
You would then take a sentence to test and break each into bigrams and test them against the probabilities (doing the above for 0 probabilities), then multiply them all together to get the final probability of the sentence occurring.
I am implementing this in Python. My code looks like this, all function calls are verified to work:
#return is a counter of tuples containing ngrams: {('A','B'):C}
#this means ('A','B') means (B|A) in probabilistic terms
bigrams[0]=getBigrams(corpus[0])
...
bigrams[n]=getBigrams(corpus[n])
#return is a dictionary of the form P['A']=C
unigrams[0]=getUnigrams(corpus[0])
...
unigrams[N]=getUnigrams(corpus[n])
#generate bigram probabilities, return is P('A','B')=p, add one is done
prob[0]=getAddOneProb(unigrams[0],bigrams[0])
...
prob(n)=getAddOneProb(unigrams[n],bigrams[n])
for sentence in test:
bi=getBigrams(sentence)
uni=getUnigrams(sentence)
P[0]=...=P[n]=1 #set to 1
for b in bi:
tup=b
try:
P[0]*=prob[tup]
except KeyError:
P[0]=(1/len(unigrams[0])
#do this for all corpora
At the then I would compare all corpora, P[0] through P[n] and find the one with the highest probability
My results aren't that great but I am trying to understand if this is a function of poor coding, incorrect implementation, or inherent and-1 problems.