I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)
Related
I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers
IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.
I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.
It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.
I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.
Let's assume, I have some sets of keywords:
[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]
and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.. Obviously, it fits well to the second set of keywords.
Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck instead of ducks, or run instead of running) to enhance the algorithm. Now we're talking NLP already.
I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?
Related: What NLP tools to use to match phrases having similar meaning or semantics
There is not one correct solution for such a task. you have to try what fits your problem!
Possible ways to solve your problem I can think of:
Matching: either exact or more elaborated such as lemma/stemming, or Levensthein.
Embedding Similarity: I guess word similarity would outperform document-keywords similarity, but again, just experiment with it.
Classification: Your problem seems to be a classic classification problem, which each set being one class. If you don't have enough labeled training data, you could try active-learning.
What's the most convenient way to analyze a sentence phrases and structure using NLTK or SpaCy?
The main goal is to get a well organized and clean data in order to apply some inferential statistics on it.
Here is a simple example of what I need, as shown in the tree above:
NP which is a Noun Phrase
VP, a Verb Phrase
ADJP, Adjective Phrase
-, a coordinating conjunction, implying that it is a compound sentence
PP, a Prepositional Phrase
The most convenient way is to use dependency parsing from spacy. https://spacy.io/usage/linguistic-features#dependency-parse
From its output you can extract whatever information you need.
It is important to memorize that no parser will ever have perfect accuracy, so best choose a large model to guarantee good quality.
This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance
Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.
I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.
I want to be able to measure ambiguity of a sentence, and my current my idea to do so is by measuring how many ways a sentence can be parsed. For example, the sentence "Fruit flies like a banana" can have to interpretations.
So far I have tried using the Stanford Parser, but it only interpreted each sentence in one way. My other idea was to measure how many different parts of speech each word in a sentence could mean, but each POS tagger I found only marked each word with 1 tag even when it could be multiple.
Are there are tools to do either?
From the Stanford Parser FAQ page, hope it helps:
Can I obtain multiple parse trees for a single input sentence?
Yes, for the PCFG parser (only). With a PCFG parser, you can give the option -printPCFGkBest n and it will print the n highest-scoring parses for a sentence. They can be printed either as phrase structure trees or as typed dependencies in the usual way via the -outputFormat option, and each receives a score (log probability). The k best parses are extracted efficiently using the algorithm of Huang and Chiang (2005).