Page similarity calculation with Gensim

Page similarity calculation with Gensim - python

there are some questions about this which I've studied, but I'm still not sure about a couple of things.I want to split a page stream (concatenated PDFs) in to singular documents. So the trick is to find where one document ends and where the next one starts. So a PDF can have 1000 pages, and can consist out of 20 documents, each with different lengths.
That being said, one feature I want to introduce is 'page similarity' where page (p) has a similarity score for the page before it (p-1).
So studying this problem leads me to a lot of examples using LDA and LSI models, but is this the way to go?
I have made a corpus with all the tokens, bigrams, trigrams from all 1000 pages. What is the best way to compare two pages with each other? I have looked had this example where an LSI model is used to compare a query with a whole corpus, but I can't figure out how to compare it with just one previous page/document.
Any other ideas will be greatly appreciated!
texts = data_lemmatized # --> all tokenized + filtered + bigrams + trigrams using gensim
dictionary = corpora.Dictionary(data_lemmatized)
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(data_lemmatized[1]). #--> this is page 2, which I want to compare with data_lemmatized[0]
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # -->this performs a similarity query against the corpus, but I want only 1 page

There are many different possible text-similarity calculations; which one is the "way to go" will depend on your data, project goals, & resources. Both LDA & LSI are reasonable things to try.
The Gensim models work on whatever 'documents' you give them - so if you've preprocessed your 1000-page PDF into 20 of your true documents, or 1000 separate pages, before training a topic model, those are also the units-of-text it will analyze.
Are you trying to compare the pages to infer document-boundaries? (Are you sure there's no other better hint of boundaries in the PDF? Are you sure all the desired document boundaries align with page-boundaries?)
Doing page-level comparisons might work for that, or might not it'd depend on how vividly the word/phrase usage changes from the end of one document to another.
You can generally think of many of the Gensim models as providing a summary vector for a text. The queries against all documents are useful for listing ranked matches, but if you want to do a simple pairwise calculation, you'd get the vectors for each text from the model individually, then use a direct calculation on the vectors (such as cosine-similarity). For example:
page1_bow = dictionary.doc2bow(data_lemmatized[0])
page1_lsi = lsi[page1_bow]
page2_bow = dictionary.doc2bow(data_lemmatized[1])
page2_lsi = lsi[page2_bow]
cossim = gensim.matutils.cossim(page1_lsi, page2_lsi)
(The gensim.mattutils.cossim() function is a convenience helper for calculating cosine-similarity between the sparse arrays many Gensim bag-of-words-fed models, including LSI. With other dense vectors, you might use the raw calculation for cosine-similarity, or the cosine-distance function offered by scipy.spatial.distance.cosine, or other methods.)

Related

How to create word embedding using Word2Vec on Python?

I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar word or similarity between two words.
But, how if I have text data X and I want to produce the word embedding vector X_vector?
So that, this X_vector can be used for classification algorithms?

If X is a word (string token), you can look up its vector with word_model[X].
If X is a text - say, a list-of-words – well, a Word2Vec model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy is imported as np):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector() method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

How to determine if two sentences talk about similar topics?

I would like to ask you a question. Is there any algorithm/tool which can allow me to do some association between words?
For example: I have the following group of sentences:
(1)
"My phone is on the table"
"I cannot find the charger". # no reference on phone
(2)
"My phone is on the table"
"I cannot find the phone's charger".
What I would like to do is to find a connection, probably a semantic connection, which can allow me to say that the first two sentences are talking about a topic (phone) as two terms (phone and charger) are common within it (in general). Same for the second sentence.
I should have something that can connect phone to charger, in the first sentence.
I was thinking of using Word2vec, but I am not sure if this is something that I can do with it.
Do you have any suggestions about algorithms that I can use to determine similarity of topics (i.e. sentence which are formulated in a different way, but having same topic)?

In Python I'm pretty sure you have a Sequence Matcher that you can usee
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
If you want your own algorithm I would suggest a Levenstains Distance (it calculates how many operations you need to turn one string(sentance) into another. Might be usefull.). I coded it myself in like this for two strings
edits = [[x for x in range(len(str1) + 1)] for y in range(len(str2)+ 1)]
for i in range(len(str2) + 1):
edits[i][0] = i
for i in range(1, len(str2) + 1):
for j in range(1, len(str1) + 1):
if str2[i-1] == str1[j-1]:
edits[i][j] = edits[i-1][j-1]
else:
edits[i][j] = 1 + min(edits[i-1][j-1], edits[i-1][j],
edits[i][j-1])
return edits[-1][-1]
[EDIT] For you, you want to compare if the sentances are about the similar topic. I would suggest any of the following algorithms (all are pretty easy)
Jaccary Similarity
K-means and Hierarchical Clustering Dendrogram
Cosine Similarity

This type of task is called sentence similarity, or more generally semantic textual simialirty. You might use a few different approaches for this type of task. On paperswithcode you can find benchmarks and the current state of the art.
First you can look at the ratio of shared words. Jaccard index is probably the simplest metric you can use for this. If you model both sentences as sets of words the jaccard index is the size of the intersection divided by the size of the union of these two sets.
Another way is to turn these sentences into vectors by counting the words and using cosine similarity to measure how closely they are related.
But not every word is equally important. To use this in your computations you can use a weighting scheme such as term frequency - inverse document frequency (TF-IDF) or BM25, which in their essence assign a greater weight to more important words. They measure the importance of the words by looking at how frequently they appear in all documents in your corpus.
You can improve these methods by only using entities mentioned in the text. In your example that would be I, phone, table and the charger. You can use spaCy or stanza for finding the entities if you are using python.
If you are using word embeddings such as word2vec, glove, or fasttext, you can take the average of the word vectors and use it as a vector for the whole sentence. Then you can again use cosine similarity.
Or on the more complicated side using word embeddings, you can use word mover's distance to measure the distance between two collections of word vectors.
There are also neural models for sentence similarity. Using transformer models is currently the state of the art for this kind of problem, as we can see on STSBenchmark a BERT-based transformer model is currently in the first place. This type of models usually need a lot of computation power to work, but you don't have to train each model from scratch, you can just download a model and use it right away.
There are many more methods for this probably. Here is a recent survey on semantic similarity methods.

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?

To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.

Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.

Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

How to find most similar terms/words of a document in doc2vec? [duplicate]

This question already has answers here:
How to intrepret Clusters results after using Doc2vec?
(3 answers)
Closed 5 years ago.
I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster.
My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton

#TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.
Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.
So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)
For example:
docvec = d2v_model.docvecs['doc77145'] # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)

To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.
Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.
Take a look at Section 5.1 here for more details on the use of TF-IDF.

Data Structure for Text Classification Task

I'm doing a text classification / tagging task and I would like to ask what kind of data structure would serve me best. The training data set I have is about 4 gigs (after some cleaning, but should be even smaller if I discard the rare words) with 6 million documents. Each document has 4 fields:
Document ID
Title
Body
Tags (as a string, e.g. "apple sql-server linux". This represents three tags, separated by a space. Documents can have 1-5 tags)
I've just finished the cleaning phase (stemming, stop words etc etc) and I'm about to convert them into a TF-IDF word vector with scikit so the output is a scipy sparse matrix. I would like to keep the Title and Body as two vectors and combine them at a later stage when I decide on what weighting to give the Title. The Title and Body are sparse vectors, but they are built with the same dictionary so have the same no. of columns.
What is the best way to represent this information? I come from R so I'm just used to storing things in data.tables / data frames but that doesn't seem very applicable for text classification and sparse matrices. One thing I thought about doing is creating my own "Document" class and just have a list of these objects to represent the corpus. I don't think this is very efficient, since I would probably want to do something like return all docs with the Tag apple.
ML algorithms I plan to run are k-means clustering, kNN, Naive Bayes and possibly SVM. There will probably others that I haven't thought about yet.
I'm new to Python and text classification - any help is greatly appreciated and I am especially interested in ppl who have done it before.
Thank you!

Your best bet is a list of dictionary objects. A list of keep all the documents, and a dictionary to keep all the information regarding the document.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.