I have got a set of verbatims/sentences and what I am trying to do is ....if two sentences have the same meaning, those sentences should be replaced by the original one and later on,I got to take the frequency of such sentences.
Is there a way I can do it in NLTK? Any suggestions in this regard are welcome and appreciated.
I am looking for NLP approach.
Thanks
I would consider using some more up-to-date ideas for word/document embeddings for sentence similarity, such as:
https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1
https://github.com/facebookresearch/StarSpace - recently this implementation has been added to RASA NLU - https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/classifiers/embedding_intent_classifier.py
https://github.com/commonsense/conceptnet-numberbatch
http://alt.qcri.org/semeval2017/task1/ - it's annual competition related to NLP tasks, Semantic Textual Similarity is also there. It could be a really nice source of ideas for you.
On the one hand, sentence embeddings could be used to compare sentences easily, on the other hand, you have word embeddings that could be averaged/summed up to get a whole sentence embedding. To compare sentence vectors metrics such as cosine similarity could be used.
I found some papers that might be able to give you a few ideas on how to solve this problem. They use WordNet, which is a corpus that can be used for checking similarity of words, and it is available on NLTK:
Corley, Courtney, and Rada Mihalcea. "Measuring the semantic similarity of texts." Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment. Association for Computational Linguistics, 2005.
--> translates word-to-word similarity at a text level and I believe you can adapt it for sentences. (https://aclanthology.info/pdf/W/W05/W05-1203.pdf)
Honeck, Richard P. "Semantic similarity between sentences." Journal of psycholinguistic research 2.2 (1973): 137-151. --> Here is another paper that calculates similarity scores between sentences.
I only skimmed the two papers, but it seems that the first paper uses syntactic and semantic similarity techniques sequentially whereas the second one uses them parallelly.
Miller, George A., and Walter G. Charles. "Contextual correlates of semantic similarity." Language and cognitive processes 6.1 (1991): 1-28. --> This is a linguistics paper which might give you a better understanding on how to compare the semantic similarity of sentences in case the first two methods do not work out for you, and you have to come up with your own solution.
Good luck and hope this helps!
Related
I wonder which algorithm is the best for semantic similarity? Can anyone explain why?
Thank you!
Semantic similarity of what - words,, phrases, sentences, paragraphs, documents, other? And 'best' with respect to what end goal?
The original paper which defined 'Word Mover's Distance', "From Word Embeddings To Document Distances", gave some examples of where WMD works well, and comparisons of its behavior against other similarity-calculations.
But, WMD is far more expensive to calculate, especially on longer texts. And as a method which uses every word's presence, regardless of ordering, it still isn't strong in cases where tiny grammatical changes – such as the addition of a 'not' in the right place – might completely reverse a text's meaning to human readers. (But then again, quick-and-simple comparisons like the cosine-similarity between two bag-of-words representations, or between two average-of-word-vectors representations, aren't strong there either.)
Is there an algorithm that can automatically calculate a numerical rating of the degree of abstractness of a word. For example, the algorithm rates purvey as 1, donut as 0, and immodestly as 0.5 ..(these are example values)
Abstract words in the sense words that refer to ideas and concepts that are distant from immediate perception, such as economics, calculating, and disputable. Other side Concrete words refer to things, events, and properties that we can perceive directly with our senses, such as trees, walking, and red.
There's no definition of abstractness that I know of, neither any algorithm to calculate it.
However, there are several directions I would use as proxies
Frequency - Abstract concepts are likely to be pretty rare in a common speech, so a simple idf should help identify rare words.
Etymology - Common words in English, are usually decedent from Germanic origin, while more technical words are usually borrowed from French / Latin.
Supervised learning - If you have Wikipedia articles you find abstract, then the common phrases or word would probably also describe similar abstract concepts. Training a classifier can be a way to score.
There's no ground truth as to what is abstract, and what is concrete, especially if you try to quantify it.
I suggest aggregating these proxies to a metric you find useful for your needs.
How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.
Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.
Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.
I am new to clustering and need some advice on how to approach this problem...
Let's say I have thousands of sentences, but a few from the sample could be:
Experience In Networking
STRONG Sales Experience
Strong Networking Skills Preferred
Sales Expertise REquired
Chocolate Apples
Jobs are crucial for Networking Majors
In order to cluster these the best way, what approach could I take?
I have looked into k-means with word vectoring, but when I have thousands of sentences that may all contain different words, would this be efficient to build a vector of that size and then go through each trying to see which sentence has those words?
What other approaches are out there that I have not found?
What I have done so far:
Imported the sentences from CSV to a DICT With ID: Sentence
I am removing stop words from each sentence
I am then counting all words individually to build a master vector and keeping a count of how many times a word appears.
There are two related (but distinct technique-wise) questions here; the first is relates to choice of clustering technique for this data.
The second, predicate question relates to the data model--i.e., for each sentence in the raw data, how to transform it to a data vector suitable for input to a clustering algorithm.
Clustering Technique
k-means is probably the most popular clustering technique, but there are many betters; consider how k-kmeans works: the user selects from among the data, a small number of data points (the cluster centers for the initial iteration in the k-means algorithm, aka centroids). Next, the distance between each data point and the set of centroids is determined and each data point assigned to the centroid it is closes to; then new centroids are determined from the mean value of the data points assigned to the same cluster. These two steps are repeated until some convergence criterion is reached (e.g., between two consecutive iterations, the centroids combined movement falls below some threshold).
The better clustering techniques do much more than just move the cluster centers around--for instance, spectral clustering techniques rotate and stretch/squeeze the data to find a single axis of maximum variance then determine additional axes orthogonal to the original one and to each other--i.e., a transformed feature space. PCA (principal component analysis), LDA (linear discriminant analysis), and kPCA are all members of this class, the defining characteristic of which is that that calculation of the eigenvalue/eigenvector pairs for each feature in the original data or in the covariance matrix. Scikit-learn has a module for PCA computation.
Data Model
As you have observed, the common dilemma in constructing a data model from unstructured text data is including a feature for every word in the entire corpus (minus stop words) often results in very high sparsity over the dataset (i.e., each sentence includes only a small fraction of the total words across all sentences so each data vector is consequently sparse; on the other hand, if the corpus is trimmed so that for instance only the top 10% of the words are used as features, then some/many of the sentences have completely unpopulated data vectors.
Here's one common sequence of techniques to help solve this problem, which might be particularly effective given your data: Combine related terms into a single term using the common processing sequence of normalizing, stemming and synonymizing.
This is intuitive: e.g.,
Normalize: transform all words to lowercase (Python strings have a lower method, so
REquired.lower()
Obviously, this prevents Required, REquired, and required from comprising three separate features in your data vector, and instead collapses them into a single term.
Stem: After stemming, required, require, and requiring, are collapsed to a single token, requir.
Two of the most common stemmers are the Porter and Lancaster stemmers (the NLTK, discussed below, has both).
Synonymize: Terms like fluent, capable, and skilled, can, depending on context, all be collapsed to a single term, by identifying in a common synonym list.
The excellent Python NLP library, NLTK has (at least) several excellent synonym compilations, or digital thesaurus (thesauri?) to help you do all three of these, programmatically.
For instance, nltk.corpus.reader.lin is one (just one, there are at least several more synonym-finders in the NLTLK), and it's simple to use--just import this module and call synonym, passing in a term.
Multiple stemmers are in NLTK's stem package.
I actually just recently put together a guide to document clustering in Python. I would suggest using a combination of k-means and latent dirichlet allocation. Take a look and let me know if I can further explain anything: http://brandonrose.org/clustering
I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.
In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:
First of all, neither from the perspective of computational
linguistics nor of theoretical linguistics is it clear what
the term 'semantic similarity' means exactly. ....
Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact
opposite of 1, still it is about Pete and Rob (not) finding a
dog.
My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.
I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?
This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?
Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.
I suggest you try a topic modelling framework such as Latent Dirichlet Allocation (LDA). The idea there is that documents (in your case sentences, which might prove to be a problem) are generated from a set of latent (hidden) topics; LDA retrieves those topics, representing them by word clusters.
An implementation of LDA in Python is available as part of the free Gensim package. You could try to apply it to your sentences, then run k-means on its output.