I need to generate paraphrase of an english sentence using the PPDB paraphrase database
I have downloaded the datasets from the website.
I would say your first step needs to be reducing the problem into more manageable components. Second figure out whether you want to paraphrase on a one-to-one, lexical, syntactical, phrase or combination basis. To inform this decision I would take one sentence and paraphrase it myself in order to get an idea of what I am looking for. Next I would start writing a parser for the downloaded data. Then I would remove the stopwords and incorporate a part-of-speech tagger like the ones included in spaCy or nltk for your example phrase.
Since they seem to give you all the information needed to make a successive dictionary filter that is where I would start. I would write a filter which found the parts of speech for each word in my sentence in the [LHS] column of the dataset and select a source that matches the word while minimizing/maximizing the value of 1 feature (like minimizing WordLenDiff) which in the case of "businessnow" <- "business now" = -1.5. Keeping track of the target feature you will then have a basic paraphrased sentence.
using this strategy your output could turn:
"the business uses 4 gb standard."
sent_score = 0
into:
"businessnow uses 4gb standard"
sent_score = -3
After you have a basic example the you can start exploring feature selection algorithms in like those in scikit-learn, etc. and incorporate word alignment. But I would seriously cut down on the scope of the problem and increase it gradually. In the end, how you approach the problem it depends on what the designated use is and how functional it needs to be.
Hope this helps.
Related
my goal is very simple: I have a set of strings or a sentence and I want to find the most similar one within a text corpus.
For example I have the following text corpus: "The front of the library is adorned with the Word of Life mural designed by artist Millard Sheets."
And I'd like to find the substring of the original corpus which is most similar to: "the library facade is painted"
So what I should get as output is: "fhe front of the library is adorned"
The only thing I came up with is to split the original sentence in substrings of variable lengths (eg. in substrings of 3,4,5 strings) and then use something like string.similarity(substring) from the spacy python module to assess the similarities of my target text with all the substrings and then keep the one with the highest value.
It seems a pretty inefficient method. Is there anything better I can do?
It probably works to some degree, but I wouldn't expect the spacy similarity method (averaging word vectors) to work particularly well.
The task you're working on is related to paraphrase detection/identification and semantic textual similarity and there is a lot of existing work. It is frequently used for things like plagiarism detection and the evaluation of machine translation systems, so you might find more approaches by looking in those areas, too.
If you want something that works fairly quickly out of the box for English, one suggestion is terp, which was developed for MT evaluation but shown to work well for paraphrase detection:
https://github.com/snover/terp
Most methods are set up to compare two sentences, so this doesn't address your potential partial sentence matches. Maybe it would make sense to find the most similar sentence and then look for substrings within that sentence that match better than the sentence as a whole?
I am wondering for the term in Machine Learning, Deep Learning, or in Natural Language Processing that split the word in a paragraph when there is no space between them.
example:
"iwanttocook"
become:
"i want to cook"
It wouldn't be easy since you do not have the character to tokenize the word.
I appreciate any help
You could achieve this using the polyglot package. There is an option for morphological analysis.
This kind of analysis is based on morfessor models trained on most frequent words to encounter morphemes ("primitive units of syntax, the smallest individually meaningful elements in the utterances of a language").
From the documentation:
from polyglot.text import Text
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
print(text.morphemes)
The output would be:
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
Note that if you want to start working with polyglot, you should first read the documentation carefully, as there are a few things to consider, for example the downloading of language specific models.
I created an simple application using Scala to do this.
https://github.com/shredder47/Nonspaced-Sentence-Tokenizer
Currently I'am working on a project which is emotions detecting (happy,sad etc) from text in a chat application using Python and NLTK. I'am not much familiar with NLP and Python. As a basic way, I hope to use keyword based method. In that case I have to make a emotional keyword list under each emotional state and need to find whether there is any emotional keyword in given sentence and identify the relevant emotional state accordingly. So what I need here to know, do I need to create a training data set and feature list to do that task, If yes, how can I do it. Please help me.
You will need a set of words that have been labeled. One place to start is the AFINN sentiment dictionary which is a large set of words that have been manually labeled. The slides by Wei-Ting Kuo shows how to use the AFINN word set.
Laurent Luce's blog walks through the entire sentiment analysis process using Tweets although he starts with a labeled training set.
Also take a look at NLTK's 'How To' on sentiment analysis
There are a number of emotion data sets that may help at https://www.w3.org/community/sentiment/wiki/Datasets#Emotions_datasets_by_Media_Core_.40_UFL.
The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.
Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...
When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.
The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.
The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.
One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms
Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.
What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.