Gensim word2vec augment or merge pre-trained vectors - python

I am loading pre-trained vectors from a binary file generated from the word2vec C code with something like:
model_1 = Word2Vec.load_word2vec_format('vectors.bin', binary=True)
I am using those vectors to generate vector representations of sentences that contain words that may not have already existing vectors in vectors.bin. For example, if vectors.bin has no associated vector for the word "yogurt", and I try
yogurt_vector = model_1['yogurt']
I get KeyError: 'yogurt', which makes good sense. What I want is to be able to take the sentence words that do not have corresponding vectors and add representations for them to model_1. I am aware from this post that you cannot continue to train the C vectors. Is there then a way to train a new model, say model_2, for the words with no vectors and merge model_2 with model_1?
Alternatively, is there a way to test if the model contains a word before I actually try to retrieve it, so that I can at least avoid the KeyError?

Avoiding the key error is easy:
[x for x in 'this model hus everything'.split() if x in model_1.vocab]
The more difficult problem is merging a new word to an existing model. The problem is that word2vec calculates the likelihood of 2 words being next to each other, and if the word 'yogurt' wasn't in the first body that the model was trained on it's not next to any of those words, so the second model would not correlate to the first.
You can look at the internals when a model is saved (uses numpy.save) and I would be interested in working with you to come up with code to allow adding vocabulary.

This is a great question, and unfortunately there is no way to add to the vocabulary without changing the internals of the code. Check out this discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online$20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ
My advice is to ignore words that are not in the vocabulary, and only use the ones that are in the vocabulary. If you are using python, you can do this by:
for word in wordlist:
if word in model.vocab:
present.append(word)
else:
# this is all the words that are absent for your model
# might be useful for debugging. Ignore if you dont need this info
absent.append(word)
<Do whatever you want with the words in the list 'present'>

A possible alternative to handle absent/missing words is suggested by YoonKim in "Convolutional Neural Networks for Sentence Classification"
Its code: https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88
def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
"""
For words that occur in at least min_df documents, create a separate word vector.
0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
"""
for word in vocab:
if word not in word_vecs and vocab[word] >= min_df:
word_vecs[word] = np.random.uniform(-0.25,0.25,k)
But this works bcoz, you use model to find to find corresponding vectors. Functionality like similarity etc are lost

you can continue adding new words/sentences to a model vocabulary and train the augmented model, with gensim online training algorithm (https://rutumulkar.com/blog/2015/word2vec/),
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
['Advanced', 'users', 'can', 'load', 'a', 'model',
'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)
related:
Update gensim word2vec model
Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

Related

Text similarity using Word2Vec

I would like to use Word2Vec to check similarity of texts.
I am currently using another logic:
from fuzzywuzzy import fuzz
def sim(name, dataset):
matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
return
(name is my column).
For applying this function I do the following:
df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)
Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?
Example of dataset:
Text
Hello, this is Peter, what would you need me to help you with today?
I need you
Good Morning, John here, are you calling regarding your cell phone bill?
Hi, this this is John. What can I do for you?
...
The first text and the last one are quite similar, although they have different words to express similar concept.
I would like to create a new column where to put, for each row, text that are similar.
I hope you can help me.
TLDR; skip to the last section (part 4.) for code implementation
1. Fuzzy vs Word embeddings
Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.
These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.
Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.
NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.
2. Similarity between word vectors / sentence vectors
“You shall know a word by the company it keeps”
Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.
A good way to find how similar 2 words vectors is cosine-similarity. Read more here.
3. Pre-trained word2vec models (and others)
The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.
You can check similarity between these sentence embeddings using cosine_similarity
4. Sample code implementation
I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data
from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.

i want to get a list of semantically similar words from the two embedded documents in python

I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.
please, help me out.
Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.
A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.
So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.

embedding word positions in keras

I am trying to build a relation extraction system for drug-drug interactions using a CNN and need to make embeddings for the words in my sentences. The plan is to represent each word in the sentences as a combination of 3 embeddings: (w2v,dist1,dist2) where w2v is a pretrained word2vec embedding and dist1 and dist2 are the relative distances between each word in the sentence and the two drugs that are possibly related.
I am confused about how I should approach the issue of padding so that every sentence is of equal length. Should I pad the tokenised sentences with some series of strings(what string?) to equalise their lengths before any embedding?
You can compute the maximal separation between
entity mentions linked by a relation and choose an
input width greater than this distance. This will ensure
that every input (relation mention) has same length
by trimming longer sentences and padding shorter
sentences with a special token.

Remove single occurrences of words in vocabulary TF-IDF

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))
My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.
Any tips or links to further implementation?
you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a
proportion of documents, integer absolute counts. This parameter is
ignored if vocabulary is not None.
# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)
you can also remove common words:
# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)
you can also remove stopwords like this:
tfidf = TfidfVectorizer(stop_words='english')
ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want. You would have to:
Fit the vectorizer to your corpus. Extract the complete vocabulary from the vectorizer
Take the words as keys in a new dictionary.
count every word occurrence:
for every document in corpus: for word in document: vocabulary[word] += 1
Now, find out if there are values = 1, drop these entries from the dictionary. Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.
It will need a lot of looping, maybe just use min_df, which works well in practice.

Term weighting for original LDA in gensim

I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf...
My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure.
It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.
The correct format is that of the corpus defined in the first tutorial on the Gensim webpage (these are really useful).
Namely, if you have a dictionary as defined in Radim's tutorial, and the following documents,
doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]
then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count), where dk refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with
corpus = [dictionary.doc2bow(doc) for doc in docs]
That doc2bow function means "document to bag of words".

Categories

Resources