How to create word embedding using Word2Vec on Python? - python

I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar word or similarity between two words.
But, how if I have text data X and I want to produce the word embedding vector X_vector?
So that, this X_vector can be used for classification algorithms?

If X is a word (string token), you can look up its vector with word_model[X].
If X is a text - say, a list-of-words – well, a Word2Vec model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy is imported as np):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector() method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

Related

How to determine if two sentences talk about similar topics?

I would like to ask you a question. Is there any algorithm/tool which can allow me to do some association between words?
For example: I have the following group of sentences:
(1)
"My phone is on the table"
"I cannot find the charger". # no reference on phone
(2)
"My phone is on the table"
"I cannot find the phone's charger".
What I would like to do is to find a connection, probably a semantic connection, which can allow me to say that the first two sentences are talking about a topic (phone) as two terms (phone and charger) are common within it (in general). Same for the second sentence.
I should have something that can connect phone to charger, in the first sentence.
I was thinking of using Word2vec, but I am not sure if this is something that I can do with it.
Do you have any suggestions about algorithms that I can use to determine similarity of topics (i.e. sentence which are formulated in a different way, but having same topic)?
In Python I'm pretty sure you have a Sequence Matcher that you can usee
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
If you want your own algorithm I would suggest a Levenstains Distance (it calculates how many operations you need to turn one string(sentance) into another. Might be usefull.). I coded it myself in like this for two strings
edits = [[x for x in range(len(str1) + 1)] for y in range(len(str2)+ 1)]
for i in range(len(str2) + 1):
edits[i][0] = i
for i in range(1, len(str2) + 1):
for j in range(1, len(str1) + 1):
if str2[i-1] == str1[j-1]:
edits[i][j] = edits[i-1][j-1]
else:
edits[i][j] = 1 + min(edits[i-1][j-1], edits[i-1][j],
edits[i][j-1])
return edits[-1][-1]
[EDIT] For you, you want to compare if the sentances are about the similar topic. I would suggest any of the following algorithms (all are pretty easy)
Jaccary Similarity
K-means and Hierarchical Clustering Dendrogram
Cosine Similarity
This type of task is called sentence similarity, or more generally semantic textual simialirty. You might use a few different approaches for this type of task. On paperswithcode you can find benchmarks and the current state of the art.
First you can look at the ratio of shared words. Jaccard index is probably the simplest metric you can use for this. If you model both sentences as sets of words the jaccard index is the size of the intersection divided by the size of the union of these two sets.
Another way is to turn these sentences into vectors by counting the words and using cosine similarity to measure how closely they are related.
But not every word is equally important. To use this in your computations you can use a weighting scheme such as term frequency - inverse document frequency (TF-IDF) or BM25, which in their essence assign a greater weight to more important words. They measure the importance of the words by looking at how frequently they appear in all documents in your corpus.
You can improve these methods by only using entities mentioned in the text. In your example that would be I, phone, table and the charger. You can use spaCy or stanza for finding the entities if you are using python.
If you are using word embeddings such as word2vec, glove, or fasttext, you can take the average of the word vectors and use it as a vector for the whole sentence. Then you can again use cosine similarity.
Or on the more complicated side using word embeddings, you can use word mover's distance to measure the distance between two collections of word vectors.
There are also neural models for sentence similarity. Using transformer models is currently the state of the art for this kind of problem, as we can see on STSBenchmark a BERT-based transformer model is currently in the first place. This type of models usually need a lot of computation power to work, but you don't have to train each model from scratch, you can just download a model and use it right away.
There are many more methods for this probably. Here is a recent survey on semantic similarity methods.

Creating vector space

I've got a question:
I have a lot of documents and each line built by some pattern.
Of course, I have this array of patterns.
I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space.
Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors).
I don't know how can I do this?
I know about sklearn libraries and CountVectorizer/TfidfVectorizer/HashingVectorizer - but this depends on the vocabulary size. But, again, I have a lot of documents, that's why it'll be too much words in vocabulary (if do this way, but in next new document it can be new word which this vocabulary wouldn't have. That's way this is wrong way to solve my problem)
Also Keras library with it's Text Preprocessing won't solve my problem two. E.x. "one hot" encodes a text into a list of word indexes of size . But each document may have different size and of course the order. That's way comparing two vectors may give big distance, but in fact this vectors (words, that encoded by this vectors) are very similar.

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?
To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.
Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.
Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

How to find most similar terms/words of a document in doc2vec? [duplicate]

This question already has answers here:
How to intrepret Clusters results after using Doc2vec?
(3 answers)
Closed 5 years ago.
I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster.
My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton
#TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.
Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.
So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)
For example:
docvec = d2v_model.docvecs['doc77145'] # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)
To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.
Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.
Take a look at Section 5.1 here for more details on the use of TF-IDF.

Method to do Feature Agglomeration/summation?

I.E - Combining least frequent or informative bigram frequency counts together.
E.G - If I have frequency counts of letter pairs for a sequence, what's a good way to merge similar features together. (For example: "KR" and "RK" into a single feature and so on, or combining all the pairs with a count of 0 together..).
I know scikit learn has something called "ward's agglomerative clustering", but that seems aimed at visual data/pixels, and i'm interested in text data (Protein sequences and bioinformatics). I'd rather avoid clustering if there's a more direct method for concatenating the features together. (I lack background, and haven't done clustering before, and analysis of the features is important to us).
Thanks!

Categories

Resources