What are some good ways of estimating 'approximate' semantic similarity between sentences?

What are some good ways of estimating 'approximate' semantic similarity between sentences? - python

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.
In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:
First of all, neither from the perspective of computational
linguistics nor of theoretical linguistics is it clear what
the term 'semantic similarity' means exactly. ....
Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact
opposite of 1, still it is about Pete and Rob (not) finding a
dog.
My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.
I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?
This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?

Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.

I suggest you try a topic modelling framework such as Latent Dirichlet Allocation (LDA). The idea there is that documents (in your case sentences, which might prove to be a problem) are generated from a set of latent (hidden) topics; LDA retrieves those topics, representing them by word clusters.
An implementation of LDA in Python is available as part of the free Gensim package. You could try to apply it to your sentences, then run k-means on its output.

Related

Compare similarity of two names and identify duplicates with neural network

I have a dataset which contains pairs of names, it looks like this:
ID; name1; name2
1; Mike Miller; Mike Miler
2; John Doe; Pete McGillen
3; Sara Johnson; Edita Johnson
4; John Lemond-Lee Peter; John LL. Peter
5; Marta Sunz; Martha Sund
6; John Peter; Johanna Petera
7; Joanna Nemzik; Joanna Niemczik
I have some cases, which are labelled. So I check them manually and decide if these are duplicates or not. The manual judgement in these cases would be:
1: Is a duplicate
2: Is not a duplicate
3: Is not a duplicate
4: Is a duplicate
5: Is not a duplicate
6: Is not a duplicate
7: Is a duplicate
(The 7th case is a specific case, because here phonetics come into the game too. However, this is not the main problem, I am ok with ignoring phonetics.)
A first approach would be to calculate the Levenshtein-distance for each pair and mark those as a duplicate, where the Levenshtein-distance is for example less or equal than 2. This would lead to the following output:
1: Levenshtein distance: 2 => duplicate
2: Levenshtein distance: 11 => not a duplicate
3: Levenshtein distance: 4 => not a duplicate
4: Levenshtein distance: 8 => not a duplicate
5: Levenshtein distance: 2 => duplicate
6: Levenshtein distance: 4 => not a duplicate
7: Levenshtein distance: 2 => duplicate
This would be an approach which uses a "fixed" algorithm based on the Levinshtein distance.
Now, I would like to do this task with using a neural network / machine learning:
I do not need the neural network to detect semantic similarity, like "hospital" and "clininc". However, I would like to avoid the Levenshtein-distance, as I would like the ML algorithm to be able to detect "John Lemond-Lee Peter" and "John LL. Peter" as a potential duplicate, also not with a 100% certainty. The Levenshtein distance would lead to a relative high number in this case (8), as there are quite some characters to be added. In a case like "John Peter" and "Johanna Petera" the Levenshtein-distance would lead to a smaller number (4), however this is in fact no duplicate and for this case I would hope that the ML algorithm would be able to detect that this is likely not a duplicate. So I need the ML algorithm to "learn the way I need the duplicates to be checked". With my labelling I would give as an input I would give the ML algorithm the direction, of what I want.
I actually thought that this should be an easy task for a ML algorithm / neural network, but I am not sure.
How can I implement a neural network to compare the pairs of names and identify duplicates without using an explicit distance metric (like the Levenshtein distance, euclidean etc.)?
I thought that it would be possible to convert the strings to numbers and a neural network can work with this and learn to detect duplicates according to my labelling style. So without having to specify a distance metric. I thought about an human: I would give this task to a person and this person would judge and make a decision. This person has no clue about a Levenshtein-distance or any other mathematical concept. So I just want to train the neural network to learn to do what the human is doing. Of course, every human is different and it also depends on my labelling.
(Edit: The ML/neural network solutions I have seen so far (like this) use a metric like levenshtein as a feature input. But as I said I thought it should be possible to teach the neural network the "human judgement" without making use of such a distance measure? Regarding my specific case with having pairs of names: What would the benefit be a of a ML approach using levenshtein distance as a feature? Because it will just detect those pairs of names as a duplicate that have a low levenshtein distance. So I could use a simple algorithm to mark a pair as duplicate if the levenshtein distance between the two names is less than x. Why use a ML instead, what would be the additional benefit?)

In my experience, OpenAI's GPT-3 works well with such tasks (I'm using it for analyzing astrophysical texts). You should describe a task in the natural language and then provide a few examples for few-shot learning. Here's the quick experiment I've performed in OpenAI Playground (green text was generated by GPT-3):

A naive approach will be somewhat similar to using Levenstein distance. First, convert both names to vectors via pretrained language model (I think FastText will be the best choice as it uses ngrams and will be more sensetive to chars). Than combine these two vectors (the first thing that came to mind is to compute a metric, e.g. calculate Euclidian distance between them). Now, you can see this task as a classification problem, and you can pass calculated metric (or other function) and label (duplicate/not duplicate) to classifier. So, in fact you'll be still computing distance between names but instead of names themself, it will be their high-dimensional representation.
Probably this approach isn't a best choice but it can be a nice baseline for your task. Your problem is a special case of a so called Similarity learning, so you can do a research and choose a specific method from this field.
Also you can take a look on this paper. There authors use character-based measures to vectorize texts and than pass them to ML models.

The task you are solving is usually called Fuzzy Matching. There are some libraries that implement well known algorithms that may help you, like fuzzyset, fuzzywuzzy or difflib. Consider giving a try to some of those.
If you still need to look for machine learning approche, consider that your first requirement is a dataset with pair of texts labeled as match or not match and then implement a binary classifier.
In general rules, classical achine learning algorithms require less data and less parameter tunning to solve the task, but you need to provide better features to the model (which ofteneans you spend more time in the feature engineering stage), but I think your problem is simple-enough to be solved with just machine learning.
If you want to try neutral networks you could try Siamese networks or implement a binary classifier.
That said, make sure that your implementation consider the input text at char level instead of word level.

I have read carefully whole your question, but still I don't know why you want a neural network for that.
Real, sad answer
Tweak edit distance (more general distance than Levenshtein) by adding some weights - idea: swapping characters that are close on the keyboard is more likely than those that are faraway. So distance between Asa and Ada is smaller than Asa and Ala.
Case (4) you can cover with regex.
Happy answer
If you insist to go with ML solutions, here is the sketch of what I would do if forced:
Prepare a lot of pairs labeled (a lot means e.g. 50 thousands).
Pad the names to constant length (e.g. 32 characters).
Apply character level encoding (one-hot should do the job).
Train a binary classifier (e.g. in a form of siamese network) on such inputs.

Feedback on Data Science LSTM Project

I realize that this is slightly outside the realm of what sort of questions are normally asked here, so please forgive that. I have been tasked with an open ended technical screening for a job as a data scientist. This is my first job that has asked for something like this, so I want to make sure that I am submitting really good work. I was given a dataset and asked to identify the problem and how to use machine learning to solve it, give stats on the target feature, pre-process the data data, model the data, and interpret the results.
I am looking for feedback about if I am missing anything huge in my results. High level feedback is fine. Hopefully some of you are data scientists and have either had to complete a technical screening like this or have had to review one and can offer some valuable feedback to an up-and-coming data scientist.
Thank you!
Github Link to Project

have a look on the
Mars Express Power Challenge Get the data, model and predict the
thermal power consumption
here https://kelvins.esa.int/mars-express-power-challenge/
The chalenge was to get the data and predict future consumption of the orbiter to plan how to save energy (when in the solar field there is a risk of over heating, and in the solar night a risk from being to cold)
The teams used different approach LSTM is probably the one I would choose. But the winning team conducted a very detailed explanation on the "Feature Engineering and Selection".The point is what is important is not the tool used but the correct choice of feature extraction and selection.
https://arc.aiaa.org/doi/pdf/10.2514/6.2018-2561
I read both the winning paper and your work. Really I prefer your way.
As you see if you read the paper, your methodology is quite comparable, but they put the feature extraction study at the center of the research.
You may secure your work by providing more evidences that you picked the right method for the FE. For exemple you could provide 2 method of FE and compare the result given the method, or, you explain you chosen one knowing the current state of the art about this particular paper which prove blablabla...
You could add the comparative result of ARIMA VAR VARMA and yours to illustrate the "outperform" and reference on papers of the state of the art for the past 3 years on the field, and other references on recent publication on LSTM for energy consumption prediction.
Your document end abruptly one would wait for a decorative conclusion as we are used to find in a regular paper.
That it.
(please dont take account of my only opinion as I don't feel myself data-scientist :) I will be very proud of myself the day I would be abble to produce what you done ;) thanks for sharing it was nice to read it)

If I was the evaluator, I would ask questions like,
1) What is the research/business problem?
Suggestion: Begin the report by clearly specifying the question
2) What are the existing solutions to solve the problem?
Suggestion: Add a brief literature review on existing solutions for similar problems and their results preferably in a tabular format.
3) Briefly elaborate on the descriptive and multivariate properties of the data.
Suggestion: Add descriptive and inferential statistics on the data including some preliminary hypothesis that can be derived from the variable correlations.
4) Why did you choose this particular approach to solve the problem?
Suggestion: Give credible justification backed up by quantitative hypothetical example solutions, that are in favour of the proposed approach.
5) If it's a classification task, I would ask a question like, "What is the baseline accuracy of the model?" And if its a clustering task, "What is the baseline for cluster purity?"
Suggestion: Find this accuracy from the target variable distribution.
Finally, you need to understand, why such an open-ended question is asked. There can be two possibilities;
(a) The company is new with reference to data science and is unsure of what they are looking for, meaning, they do not have either the required expertise to evaluate the candidate skills or they are simply unsure of what is their requirement. If this is the case, then it's imperative that the report is as simple and detailed as possible. Stay away from throwing jargon.
OR
(b) the company is experienced in data science and this is a filtering test. To filter out the self-proclaimed data scientist nincompoops, who think chaining some ready-made solution steps (like preprocessing, dimensionality reduction, modelling) solves a problem. The underlying idea is to figure out the analytical capabilities of a candidate.
Therefore, write the report wisely and ensure nothing is falsified.
Best of luck.

Sentence meaning Similarity and frequency

I have got a set of verbatims/sentences and what I am trying to do is ....if two sentences have the same meaning, those sentences should be replaced by the original one and later on,I got to take the frequency of such sentences.
Is there a way I can do it in NLTK? Any suggestions in this regard are welcome and appreciated.
I am looking for NLP approach.
Thanks

I would consider using some more up-to-date ideas for word/document embeddings for sentence similarity, such as:
https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1
https://github.com/facebookresearch/StarSpace - recently this implementation has been added to RASA NLU - https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/classifiers/embedding_intent_classifier.py
https://github.com/commonsense/conceptnet-numberbatch
http://alt.qcri.org/semeval2017/task1/ - it's annual competition related to NLP tasks, Semantic Textual Similarity is also there. It could be a really nice source of ideas for you.
On the one hand, sentence embeddings could be used to compare sentences easily, on the other hand, you have word embeddings that could be averaged/summed up to get a whole sentence embedding. To compare sentence vectors metrics such as cosine similarity could be used.

I found some papers that might be able to give you a few ideas on how to solve this problem. They use WordNet, which is a corpus that can be used for checking similarity of words, and it is available on NLTK:
Corley, Courtney, and Rada Mihalcea. "Measuring the semantic similarity of texts." Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment. Association for Computational Linguistics, 2005.
--> translates word-to-word similarity at a text level and I believe you can adapt it for sentences. (https://aclanthology.info/pdf/W/W05/W05-1203.pdf)
Honeck, Richard P. "Semantic similarity between sentences." Journal of psycholinguistic research 2.2 (1973): 137-151. --> Here is another paper that calculates similarity scores between sentences.
I only skimmed the two papers, but it seems that the first paper uses syntactic and semantic similarity techniques sequentially whereas the second one uses them parallelly.
Miller, George A., and Walter G. Charles. "Contextual correlates of semantic similarity." Language and cognitive processes 6.1 (1991): 1-28. --> This is a linguistics paper which might give you a better understanding on how to compare the semantic similarity of sentences in case the first two methods do not work out for you, and you have to come up with your own solution.
Good luck and hope this helps!

Word clustering in python

How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.

Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.

Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.

Sentiment analysis for Twitter in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?
I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets.
I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python.
I haven't been able to find such sentiment analyzer so far, specifically not in python.
Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python.
Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts.
BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself.
Thanks!
BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

Good luck with that.
Sentiment is enormously contextual, and tweeting culture makes the problem worse because you aren't given the context for most tweets. The whole point of twitter is that you can leverage the huge amount of shared "real world" context to pack meaningful communication in a very short message.
If they say the video is bad, does that mean bad, or bad?
A linguistics professor was lecturing
to her class one day. "In English,"
she said, "A double negative forms a
positive. In some languages, though,
such as Russian, a double negative is
still a negative. However, there is no
language wherein a double positive can
form a negative."
A voice from the back of the room
piped up, "Yeah . . .right."

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.
You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.
The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.
Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.
Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

Thanks everyone for your suggestions, they were indeed very useful!
I ended up using a Naive Bayesian classifier, which I borrowed from here.
I started by feeding it with a list of good/bad keywords and then added a "learn" feature by employing user feedback. It turned out to work pretty nice.
The full details of my work as in a blog post.
Again, your help was very useful, so thank you!

I have constructed a word list labeled with sentiment. You can access it from here:
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip
You will find a short Python program on my blog:
http://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/
This post displays how to use the word list with single sentences as well as with Twitter.
Word lists approaches have their limitations. You will find a investigation of the limitations of my word list in the article "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs". That article is available from my homepage.
Please note a unicode(s, 'utf-8') is missing from the code (for paedagogic reasons).

A lot of research papers indicate that a good starting point for sentiment analysis is looking at adjectives, e.g., are they positive adjectives or negative adjectives. For a short block of text this is pretty much your only option... There are papers that look at entire documents, or sentence level analysis, but as you say tweets are quite short... There is no real magic approach to understanding the sentiment of a sentence, so I think your best bet would be hunting down one of these research papers and trying to get their data-set of positively/negatively oriented adjectives.
Now, this having been said, sentiment is domain specific, and you might find it difficult to get a high-level of accuracy with a general purpose data-set.
Good luck.

I think you may find it difficult to find what you're after. The closest thing that I know of is LingPipe, which has some sentiment analysis functionality and is available under a limited kind of open-source licence, but is written in Java.
Also, sentiment analysis systems are usually developed by training a system on product/movie review data which is significantly different from the average tweet. They are going to be optimised for text with several sentences, all about the same topic. I suspect you would do better coming up with a rule-based system yourself, perhaps based on a lexicon of sentiment terms like the one the University of Pittsburgh provide.
Check out We Feel Fine for an implementation of similar idea with a really beautiful interface (and twitrratr).

Take a look at Twitter sentiment analysis tool. It's written in python, and it uses Naive Bayes classifier with semi-supervised machine learning. The source can be found here.

Maybe TextBlob (based on NLTK and pattern) is the right sentiment analysis tool for you.

I came across Natural Language Toolkit a while ago. You could probably use it as a starting point. It also has a lot of modules and addons, so maybe they already have something similar.

Somewhat wacky thought: you could try using the Twitter API to download a large set of tweets, and then classifying a subset of that set using emoticons: one positive group for ":)", ":]", ":D", etc, and another negative group with ":(", etc.
Once you have that crude classification, you could search for more clues with frequency or ngram analysis or something along those lines.
It may seem silly, but serious research has been done on this (search for "sentiment analysis" and emoticon). Worth a look.

There's a Twitter Sentiment API by TweetFeel that does advanced linguistic analysis of tweets, and can retrieve positive/negative tweets. See http://www.webservius.com/corp/docs/tweetfeel_sentiment.htm

For those interested in coding Twitter Sentiment Analyis from scratch, there is a Coursera course "Data Science" with python code on GitHub (as part of assignment 1 - link). The sentiments are part of the AFINN-111.
You can find working solutions, for example here. In addition to the AFINN-111 sentiment list, there is a simple implementation of builing a dynamic term list based on frequency of terms in tweets that have a pos/neg score (see here).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.