I wonder which algorithm is the best for semantic similarity? Can anyone explain why?
Thank you!
Semantic similarity of what - words,, phrases, sentences, paragraphs, documents, other? And 'best' with respect to what end goal?
The original paper which defined 'Word Mover's Distance', "From Word Embeddings To Document Distances", gave some examples of where WMD works well, and comparisons of its behavior against other similarity-calculations.
But, WMD is far more expensive to calculate, especially on longer texts. And as a method which uses every word's presence, regardless of ordering, it still isn't strong in cases where tiny grammatical changes – such as the addition of a 'not' in the right place – might completely reverse a text's meaning to human readers. (But then again, quick-and-simple comparisons like the cosine-similarity between two bag-of-words representations, or between two average-of-word-vectors representations, aren't strong there either.)
Related
Is there an algorithm that can automatically calculate a numerical rating of the degree of abstractness of a word. For example, the algorithm rates purvey as 1, donut as 0, and immodestly as 0.5 ..(these are example values)
Abstract words in the sense words that refer to ideas and concepts that are distant from immediate perception, such as economics, calculating, and disputable. Other side Concrete words refer to things, events, and properties that we can perceive directly with our senses, such as trees, walking, and red.
There's no definition of abstractness that I know of, neither any algorithm to calculate it.
However, there are several directions I would use as proxies
Frequency - Abstract concepts are likely to be pretty rare in a common speech, so a simple idf should help identify rare words.
Etymology - Common words in English, are usually decedent from Germanic origin, while more technical words are usually borrowed from French / Latin.
Supervised learning - If you have Wikipedia articles you find abstract, then the common phrases or word would probably also describe similar abstract concepts. Training a classifier can be a way to score.
There's no ground truth as to what is abstract, and what is concrete, especially if you try to quantify it.
I suggest aggregating these proxies to a metric you find useful for your needs.
I have got a set of verbatims/sentences and what I am trying to do is ....if two sentences have the same meaning, those sentences should be replaced by the original one and later on,I got to take the frequency of such sentences.
Is there a way I can do it in NLTK? Any suggestions in this regard are welcome and appreciated.
I am looking for NLP approach.
Thanks
I would consider using some more up-to-date ideas for word/document embeddings for sentence similarity, such as:
https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1
https://github.com/facebookresearch/StarSpace - recently this implementation has been added to RASA NLU - https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/classifiers/embedding_intent_classifier.py
https://github.com/commonsense/conceptnet-numberbatch
http://alt.qcri.org/semeval2017/task1/ - it's annual competition related to NLP tasks, Semantic Textual Similarity is also there. It could be a really nice source of ideas for you.
On the one hand, sentence embeddings could be used to compare sentences easily, on the other hand, you have word embeddings that could be averaged/summed up to get a whole sentence embedding. To compare sentence vectors metrics such as cosine similarity could be used.
I found some papers that might be able to give you a few ideas on how to solve this problem. They use WordNet, which is a corpus that can be used for checking similarity of words, and it is available on NLTK:
Corley, Courtney, and Rada Mihalcea. "Measuring the semantic similarity of texts." Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment. Association for Computational Linguistics, 2005.
--> translates word-to-word similarity at a text level and I believe you can adapt it for sentences. (https://aclanthology.info/pdf/W/W05/W05-1203.pdf)
Honeck, Richard P. "Semantic similarity between sentences." Journal of psycholinguistic research 2.2 (1973): 137-151. --> Here is another paper that calculates similarity scores between sentences.
I only skimmed the two papers, but it seems that the first paper uses syntactic and semantic similarity techniques sequentially whereas the second one uses them parallelly.
Miller, George A., and Walter G. Charles. "Contextual correlates of semantic similarity." Language and cognitive processes 6.1 (1991): 1-28. --> This is a linguistics paper which might give you a better understanding on how to compare the semantic similarity of sentences in case the first two methods do not work out for you, and you have to come up with your own solution.
Good luck and hope this helps!
How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.
Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.
Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.
I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?
If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.
As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:
W^ = argmaxW P(X|W) P(W)
Where P(X|W) is the likelihood and P(W) is the Language model.
In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?
The way i am going to compute Lev. distance in python is this:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
See this
And here's the tutorial to make my question clearer: Click here
PS. i am working with Python
There are a few things to say.
The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)
If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.
However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.
Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.
However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).
Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.
The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.