I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.
In order to confirm this, I wrote the following script:
import numpy as np
import fastText as ft
# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)
# Printing the result.
print(is_equal)
# Result: FALSE
But, It seems that the obtained vectors are not similar.
Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?
First, you missed the part that get_sentence_vector is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.
Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.
try this (I assume the L2 norm of each word is positive):
def l2_norm(x):
return np.sqrt(np.sum(x**2))
def div_norm(x):
norm_value = l2_norm(x)
if norm_value > 0:
return x * ( 1.0 / norm_value)
else:
return x
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3
You can see the source code here or you can see the discussion here.
You might be hitting an issue with floating point math - e.g. if one addition was done on a CPU and one on a GPU they could differ.
The best way to check if it's doing what you want is to make sure the vectors are almost exactly the same.
You might want to print out the two vectors and manually inspect them, or do the dotproduct of one_two minus one_two_avg on itself (i.e. the length of the difference between the two).
Related
I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 i.e. ||v_1 - v_i||<=r, for some real number r.
I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.
If you call most_similar() with a topn=0, it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.)
You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy's argwhere.
For example:
target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]
I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.
To do this I use the following code:
import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector
Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.
However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.
I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean
import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)
Approaches tried
In both cases documents is a list with 200 string elements with ≈ 400 words each.
Without dealing with OOV terms:
import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [document.vector for document in list(nlp.pipe(documents))]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s
Ignoring OOV terms in output vector. Note that in this case we need to add an extra 'if' statment for those cases in which all words are OOV (if this happens the output vector is r_vec):
r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
if vectors.size == 0:
# Case in which none of the words in text were in vocabulary
avg_vector = r_vec
else:
avg_vector = vectors.mean(axis=0)
return avg_vector
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [get_vector(document) for document in documents]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s
In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.
I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.
see this post by the author of Spacy which says:
The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.
Try this for example:
import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np
sentence = "I love Stack Overflow butitsalsodistractive"
print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])
np.array_equal(tokens.vector, tokensClean.vector)
#False
If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)
I have a document with many reviews. I am creating a bag-of-words BW using TfidfVectorizer. What I want to do is: I only want to use words in BW that are also in other document D.
The document D is a document with positive words. I am using this positive to improve my model. What I mean is: I only want to count the words that are positive.
Is there a way of doing this?
Thank you
I created a piece of code to do that job, as fallows:
train_x is a panda data frame with Reviews.
pos_file = open("positive-words.txt")
neg_file = open("negative-words.txt")
#creating arrays based on the files
for ln in pos_file:
pos_words.append(ln.strip())
for ln in neg_file:
neg_words.append(ln.strip())
#adding all the positive and negative words together
sentiment_words.append(pos_words)
sentiment_words.append(neg_words)
pos_file.close()
neg_file.close()
#filtering all the words that are not in the sentiment array
filtered_res =[]
for r in train_x:
keep = []
parts = r.split()
for p in parts:
if p in pos_words:
keep.append(p)
#turning the Review array back to text again
filtered_res.append(" ".join(keep))
train_x = filtered_res
Although I was able to accomplish my needs, I know that the code is not the best. Also, I was trying to find a standard function in python to do that
PS: Python has so many features that I always ask what it can do without using the amount of code that I used
Here is a bit more optimized version (because
it does not do linear search p in pos_words in the loop
it vectorize the loop (more pythonic)
instead of keeping a list for each r, it has generator version
import re
pos_words_set = set (pos_words)
def filter (r):
keep = []
# use [A-Za-z] to avoid numbers
for p in re.finditer(r"[A-Za-z0-9]+", string):
if p in pos_words_set:
keep.append(p)
return " ".join(keep)
train_x = train_x.apply(lambda x : filter(x), axis=1)
I have some documents and I'd like to find the k documents most similar to a selected document. For the sake of a reproducible example, let's say k is 1 and my documents are these
documents = ['Two roads diverged in a yellow wood,',
'And sorry I could not travel both',
'And be one traveler, long I stood',
'And looked down one as far as I could',
'To where it bent in the undergrowth']
Then I think what I want to do is the below. (I'm using CountVectorizer for transparency and simplicity, even though maybe later I'd want to use Tf-Idf and a hashing vectorizer.)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(analyzer='word')
ft = vectorizer.fit_transform(documents)
one_doc = documents[1]
one_doc_code = vectorizer.transform([one_doc])
doc_match = np.matrix(ft) * np.matrix(one_doc_code.transpose())
and now doc_match is a column vector with weights that indicate closeness of match (0 = bad match, 1 = perfect match). But in order to do the multiplication, I (in desperation, in the face of element-wise multiplication) converted to a numpy matrix, so now I have this CSR format matrix that doesn't have a todense() member (so I can't just look, not that that would scale beyond my tiny example).
What I think I want now (but haven't been able to figure out so far) is how to say "what are the indices of the top k elements of doc_match?" (even if k is not 1).
If all you want are the indices in doc_match that have the highest score, you can do:
sorted_indices = np.argsort(doc_match)
doc_match_vals_sorted = doc_match[sorted_indices]
I am new in pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to calculate correlation between them to find the relevance for the words e.g Microsoft.com, Microsoft, Microsoft com there are common words.
This is what i did but its very slow:
import hashlib
companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame()
for company in unique_companies:
df[hashlib.md5(company).hexdigest()] = [{'name': company[0], 'code': [ord(c) for c in company[0]]}]
rows = df.unstack()
for company in rows:
series1 = Series(company['code'])
for word in rows:
series2 = Series(word['code'])
if series1.corr(series2) > 0.8:
company['match'] = [word['name']]
can anyone please guide me how to find matrix correlation for the words ?
I don't think there's a corr function that will work for strings - only numerics.
If you can somehow compress your words into meaningful numeric values that preserves the "closeness" of one against another, you might then be able to "corr" them, but other options are available.
Hamming Distance is one (basic) method, but slightly better is calculating the Levenshtein difference: http://en.wikipedia.org/wiki/Levenshtein_distance
It's tricky but one way of trying this would be to build a matrix of m x n cells. Where m is the number of unique words in your first wordlist, and n is the number of unique words in the secornd wordlist - then calculate the Hamming, or Levenshtein distances between row/column identifiers.
There are python modules that package up the distance-algorithms for you -
e.g. https://pypi.python.org/pypi/python-Levenshtein/
Or you could write your own, I think the packaged ones are likely to be faster as they're C'ified.
So, assuming the Levenshtein module (I don't know, as have not used it) provides a function say getLev (word1, word2) that generates a numeric score, you should be able to feed in the contents from two wordlists from sources 1 and 2. If you make sure your inputs are already filtered for uniqueness, and maybe sorted alphabetically, that would help too.
Feed them into a matrix generation function.
Here, I've imported numpy as np and am using that module for speed
def genLevenshteinMatrix(wordlist1, wordlist2):
x = len(wordlist1)
y = len(wordlist1)
l_matrix = np.zeros(( x, y))
for i in range( 0 , x ):
x_word = wordlist1[i]
for j in range ( 0 , y ):
y_word = wordlist[j]
l_matrix[i][j] = getLev ( x_word, y_word )
Something like that should allow you to generate a matrix that stores a measure of which words are most like which other words.
Once that's created, you can interrogate it using a function like this:
def interrogate_Levenshtein_matrix (ndarray_x, wordlist1, wordlist2, float_threshold):
l = []
x = len(ndarray_x)
y = len(ndarray_x[0])
for i in range(0 , x ):
for j in range(0 , y ):
if ndarray_x[i][j] >= float_threshold:
l.append ([(wordlist1[i],wordlist2[j]),ndarray_x[i][j]])
return l
And that will output a list of words that are "close" (i.e. have a lower distance) as measured by the Levenshtein function used earlier, as a list, containing a list of the two similar words.
You might need to trim it down somehow, as I think you'll get all like-combinations twice, i.e. ['word','work'] as one return value, and ['work','word'] as another.
As you develop the code, you could swap in different correlation functions and try different threshold values.