How to determine if two sentences talk about similar topics?

How to determine if two sentences talk about similar topics? - python

I would like to ask you a question. Is there any algorithm/tool which can allow me to do some association between words?
For example: I have the following group of sentences:
(1)
"My phone is on the table"
"I cannot find the charger". # no reference on phone
(2)
"My phone is on the table"
"I cannot find the phone's charger".
What I would like to do is to find a connection, probably a semantic connection, which can allow me to say that the first two sentences are talking about a topic (phone) as two terms (phone and charger) are common within it (in general). Same for the second sentence.
I should have something that can connect phone to charger, in the first sentence.
I was thinking of using Word2vec, but I am not sure if this is something that I can do with it.
Do you have any suggestions about algorithms that I can use to determine similarity of topics (i.e. sentence which are formulated in a different way, but having same topic)?

In Python I'm pretty sure you have a Sequence Matcher that you can usee
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
If you want your own algorithm I would suggest a Levenstains Distance (it calculates how many operations you need to turn one string(sentance) into another. Might be usefull.). I coded it myself in like this for two strings
edits = [[x for x in range(len(str1) + 1)] for y in range(len(str2)+ 1)]
for i in range(len(str2) + 1):
edits[i][0] = i
for i in range(1, len(str2) + 1):
for j in range(1, len(str1) + 1):
if str2[i-1] == str1[j-1]:
edits[i][j] = edits[i-1][j-1]
else:
edits[i][j] = 1 + min(edits[i-1][j-1], edits[i-1][j],
edits[i][j-1])
return edits[-1][-1]
[EDIT] For you, you want to compare if the sentances are about the similar topic. I would suggest any of the following algorithms (all are pretty easy)
Jaccary Similarity
K-means and Hierarchical Clustering Dendrogram
Cosine Similarity

This type of task is called sentence similarity, or more generally semantic textual simialirty. You might use a few different approaches for this type of task. On paperswithcode you can find benchmarks and the current state of the art.
First you can look at the ratio of shared words. Jaccard index is probably the simplest metric you can use for this. If you model both sentences as sets of words the jaccard index is the size of the intersection divided by the size of the union of these two sets.
Another way is to turn these sentences into vectors by counting the words and using cosine similarity to measure how closely they are related.
But not every word is equally important. To use this in your computations you can use a weighting scheme such as term frequency - inverse document frequency (TF-IDF) or BM25, which in their essence assign a greater weight to more important words. They measure the importance of the words by looking at how frequently they appear in all documents in your corpus.
You can improve these methods by only using entities mentioned in the text. In your example that would be I, phone, table and the charger. You can use spaCy or stanza for finding the entities if you are using python.
If you are using word embeddings such as word2vec, glove, or fasttext, you can take the average of the word vectors and use it as a vector for the whole sentence. Then you can again use cosine similarity.
Or on the more complicated side using word embeddings, you can use word mover's distance to measure the distance between two collections of word vectors.
There are also neural models for sentence similarity. Using transformer models is currently the state of the art for this kind of problem, as we can see on STSBenchmark a BERT-based transformer model is currently in the first place. This type of models usually need a lot of computation power to work, but you don't have to train each model from scratch, you can just download a model and use it right away.
There are many more methods for this probably. Here is a recent survey on semantic similarity methods.

Related

How to create word embedding using Word2Vec on Python?

I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar word or similarity between two words.
But, how if I have text data X and I want to produce the word embedding vector X_vector?
So that, this X_vector can be used for classification algorithms?

If X is a word (string token), you can look up its vector with word_model[X].
If X is a text - say, a list-of-words – well, a Word2Vec model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy is imported as np):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector() method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

Find most similar words for OOV word

I am looking for the most similar words for out-of-vocab OOV words using gensim. Something like this:
def get_word_vec(self, model, word):
try:
if word not in model.wv.vocab:
mostSimWord = model.wv.similar_by_word(word)
print(mostSimWord)
else:
print( word )
except Exception as ex:
print(ex)
Is there are way to achieve this task? Options other than gensim also welcomed.

If you train a FastText model instead of a Word2Vec model, it inherently learns vectors for word-fragments (of configurable size ranges) in addition to full words.
In languages like English & many others (but not all), unknown words are often typos, alternate forms, or related in terms of roots and suffixes to knwon words. Thus, having vectors for subwords, then using those to tally up a good guess vector for an unknown word, can work well enough to be worth trying – better than ignoring such words, or using a totally random or origin-point vector.
There's no built-in way to try to extract such relationships from an existing set of word-vectors that isn't FastText/subword-based – but it'd be theoretically possible. You could compute edit distances to, or counts-of-shared-subwords with, all known words, & create a guess-vector by weighted combination of the N-closest words. (This might work really well with typos & rarer alternate spellings, but not as much for truly-absent novel words.)

Construct a suffix tree of a concatination of a million words and query it with a test set to find the closest match and classify

The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the closest match of each of those words in the training corpora and hence classify that word as the corresponding class of its closest match.
My Solution: Initially, I did this brute force which doesn't scale. Now I'm thinking I build a suffix tree over the concatenation of the training corpora (O(n)) and query the testing corpora (constant time). Trying to do this in python.
I'm looking for tools or packages that get me started or for other more efficient ways to solve the problem at hand. Thanks in advance.
Edit 1: As for how I am finding the closest match, I was thinking a combination of exact match alignment (from the suffix tree) and then for the part of the input string that is left over, I thought of doing a local alignment with affine gap penalty functions.

What distance metric are you using for the closest match?
There are papers that cover how to do an edit distance search using a suffix tree. For each suffix there is an extension of the edit matrix and theses can be ordered so to let one do a ranked search of the suffix tree to find the matching items in order of increasing distance.
An example for this is Top-k String Similarity Search with Edit-Distance Constraints (2013) https://doi.org/10.1109/ICDE.2013.6544886 https://scholar.google.com/scholar?cluster=13387662751776693983
The solution presented avoids computing all the entries in the table as columns are added.
In your problem it seems that for each word there are classes that apply to them if they don't depend on context then the above would work and a word to class map would all that would be needed. But if they depend on context then this seems closer to part of speech tagging.

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?

To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.

Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.

Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

How to calculate the distance in meaning of two words in Python

I am wondering if it's possible to calculate the distance/similarity between two related words in Python (like "fraud" and "steal"). These two words are not synonymous per se but they are clearly related. Are there any concepts/algorithms in NLP that can show this relationship numerically? Maybe via NLTK?
I'm not looking for the Levenshtein distance as that relates to the individual characters that make up a word. I'm looking for how the meaning relates.
Would appreciate any help provided.

My suggestion is as follows:
Put each word through the same thesaurus, to get a list of synonyms.
Get the size of the set of similar synonyms for the two words.
That is a measure of similarity between the words.
If you would like to do a more thorough analysis:
Also get the antonyms for each of the two words.
Get the size of the intersection of the sets of antonyms for the two words.
If you would like to go further!...
Put each word through the same thesaurus, to get a list of synonyms.
Use the top n (=5, or whatever) words from the query result to initiate a new query.
Repeat this to a depth you feel is adequate.
Make a collection of synonyms from the repeated synonym queries.
Get the size of the set of similar synonyms for the two words from the two collections of synonyms.
That is a measure of similarity between the words.

NLTK's wordnet is the tool you'd want to use for this. First get the set of all the senses of each word using:
synonymSet = wordnet.synsets(word)
Then loop through each possible sense of each of the 2 words and compare them to each other in a nested loop:
similarity = synonym1.res_similarity(synonym2,semcor_ic)
Either average that value or use the maximum you find; up to you.
This example is using a word similarity comparison that uses "IC" or information content. This will score similarity higher if the word is more specific, or contains more information, so generally it's closer to what we mean when we think about word similarity.
To use this stuff you'll need the imports and variables:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

As #jose_bacoy suggested above, the Gensim library can provide a measure of similarity between words using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. To use your example:
word_vectors.similarity('fraud', 'steal')
>>> 0.19978741
Twenty percent similarity may be a surprisingly low level of similarity to you if you consider these words to be similar. But consider that fraud is a noun and steal is generally a verb. This will give them very different associations as viewed by word2vec.
They become much more similar if you modify the noun to become a verb:
word_vectors.similarity('defraud', 'steal')
>>> 0.43293646

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.