Extracting all nouns from a string [duplicate] - python

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
Extracting nouns from Noun Phase in NLP
Do anyone have some examples on how to extract all nouns from a string using Python's NLTK?
For example, i have this string: "I Like Tomatoes and Lettuce". I want to build a method that returns "Tomatoes" and "Lettuce."
If not in Python, does anyone know of any other solution?

Get the NLTK package, and either use its built-in parser then this method; or, much faster, part-of-speech tag the string and get all the words out that have the tag NN; those are the nouns. Read up on other part-of-speech tags to find out how you can properly extract I and like.
Neither method is flawless, but it's about the best you can do. Accuracy of a good part-of-speech tagger will be above 95% on clean input. I don't think you can reach such accuracy with a WordNet-based method without a lot of extra work.

Dave Taylor wrote an adlib generator using Bash that queried Princetons wordnet to get this done. You could do something very similar in python of course with wordnets help.
Here is the link
Linux Journal - Dave Taylor adlib generator.

Related

Building Comprehension based simple Q&A Python model, without the need for training

Here's the approach I've in mind:
Break the data paragraph such that each sentence is a string.
Find the keywords in the question (nouns, verbs and adjectives)
perhaps using POS tagging. Lemmatization might be necessary.
Search through all the data strings to find the one which has all/most of these keywords. That sentence would most probably be the answer we're looking for.
But, I find the implementation difficult as I'm new to Python and am unaware of the necessary commands/libraries. I will be grateful if someone can guide me down this path or any better way to do it.

How to Grab meaning of sentence using NLP?

I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)

Find 'modern' nltk words corpus [duplicate]

This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance
Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.
I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.

Python NLTK PoS Tag inaccurate [duplicate]

This question already has answers here:
Python NLTK pos_tag not returning the correct part-of-speech tag
(3 answers)
Closed 6 years ago.
I've been trying to improve the POS tagger on the NLTK for a few days, but I cannot figure it out. Right now, the default tagger is really inaccurate and tags most words as 'NN'. How can I improve the tagger to make it more accurate? I've already looked up training the tagger, but I can't get it to work.
Does anybody have a simple method for this? thanks a lot.
Are you doing it one word at a time or in a large corpus? Usually POS tagging algorithms use the probability that the word is a tag type e.g "NN" but they also use the surrounding sentence context to predict so the more words, the more likely it is to be accurate.
You can also try with varying Unigram, bigram, trigram, etc tagging to try to get higher accuracy at the cost of performance. You can read about doing that here: http://www.nltk.org/book/ch05.html

Named Entity Recognition from personal Gazetter using Python

I try to do named entity recognition in python using NLTK.
I want to extract personal list of skills.
I have the list of skills and would like to search them in requisition and tag the skills.
I noticed that NLTK has NER tag for predefine tags like Person, Location etc.
Is there a external gazetter tagger in Python I can use?
any idea how to do it more sophisticated than search of terms ( sometimes multi words term )?
Thanks,
Assaf
I haven't used NLTK enough recently, but if you have words that you know are skills, you don't need to do NER- just a text search.
Maybe use Lucene or some other search library to find the text, and then annotate it? That's a lot of work but if you are working with a lot of data that might be ok. Alternatively, you could hack together a regex search which will be slower but probably work ok for smaller amounts of data and will be much easier to implement.
Have a look at RegexpTagger and eventually RegexpParser, I think that's exactly what you are looking for.
You can create your own POS tags, ie. map skills to a tag, and then easily define a grammar.
Some sample code for the tagger is in this pdf.

Categories

Resources