NLTK RegexpParser, chunk phrase by matching exactly one item - python

I'm using NLTK's RegexpParser to chunk a noun phrase, which I define with a grammar as
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
cp = RegexpParser(grammar)
This is grand, it is matching a noun phrase as:
DT if it exists
JJ in whatever number
NN or NNS, at least one
Now, what if I want to match the same but having the whatever number for JJ transformed into only one? So I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
The grammar
grammar = "NP: {<DT>?<JJ><NN|NNS>+}"
would match only when there is just one JJ, the grammar
grammar = "NP: {<DT>?<JJ>{1}<NN|NNS>+}"
which I thought would work given the typical Regexp patterns, raises a ValueError.
For example, in "This beautiful green skirt", I'd like to chunk "This green skirt".
So, how would I proceed?

Grammer grammar = "NP: {<DT>?<JJ><NN|NNS>+}" is correct for your mentioned requirement.
The example which you gave in comment section, where you are not getting DT in output -
"This beautiful green skirt is for you."
Tree('S', [('This', 'DT'), ('beautiful', 'JJ'), Tree('NP', [('green','JJ'),
('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
Here in your example, there are 2 consecutive JJs which does not meet your requirements as you said - "I want to match DT if it exists, one JJ and 1+ NN/NNS."
For updated requirement -
I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
Here, you will need to use
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
and do post processing of the NP chunks to remove extra JJ.
Code:
from nltk import Tree
chunk_output = Tree('S', [Tree('NP', [('This', 'DT'), ('beautiful', 'JJ'), ('green','JJ'), ('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
for child in chunk_output:
if isinstance(child, Tree):
if child.label() == 'NP':
for num in range(len(child)):
if not (child[num][1]=='JJ' and child[num+1][1]=='JJ'):
print child[num][0]
Output:
This
green
skirt

Related

How to use the universal POS tags with nltk.pos_tag() function?

I have a text and I want to find number of 'ADJs','PRONs', 'VERBs', 'NOUNs' etc.
I know that there is .pos_tag() function but it gives me different results , and I want to have results as 'ADJ','PRON', 'VERB', 'NOUN'.
This is my code:
import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
sentence = "this is my sample text that I want to analyze with programming language"
# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()
# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()
# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)
for w_type, num in count_of_word_type_list:
print(w_type, num)
print()
The code above works but I want to find a way to get this type of tags:
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy
I saw that there is a chapter here: https://www.nltk.org/book/ch05.html
That says:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
But I do not know how to apply that on my sample sentence.
Thanks for your help.
From https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L135
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
# Default Penntreebank tagset.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
# Universal POS tags.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

Extract action/task from text using nltk

Hi i am using nltk first time and i want to Extract action/task from text using nltk
Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july
Above here the speech to action and UI is the action.
I have started the token creation, don't know what to do next, Please guide.
from nltk import sent_tokenize
sample_text ="""Hi prakash, how are you ?. We need to complete the speech to action demo by 8 June then you will have to finish the Ui by 15 july"""
sentences = sent_tokenize(sample_text)
print(sentences) import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
sample_text = """Hi prakash, how are you ?. We need to complete the speech to action by today
then you will have to finish the UI by 15 july after that you may go finish the mobile view"""
sample_text = "need to complete the speech to action by today"
tokens = word_tokenize(sample_text.lower())
# the lower is very much required, as June and june have diffrent code NN, NNP
pos_tags = pos_tag(tokens)
result = []
for i in range(len(tokens)):
if (pos_tags[i][1] == 'VB') and (pos_tags[i][0] in ['complete','finish']):
# Here we are looking for text like (finish, complete, done)
owner = ''
for back_tag in pos_tags[:i][::-1]:
#traverse in back direction to know the owner who will (finish, complete, done)
if back_tag[1]=='PRP':
owner = back_tag[0]
break
message = ''
date = ''
for messae_index , token in enumerate(pos_tags[i:],i):
#traverse forward to know what has to be done
if token[1]=='IN':
for date_index, date_lookup in enumerate(pos_tags[messae_index:],messae_index):
if date_lookup[1]=='NN':
date = pos_tags[date_index-1][0] + ' ' + pos_tags[date_index][0]
if date_lookup[1]=='PRP':
# This is trick to stop further propegation
# Don't ask me why i am doing this, if you are still reading then read the nest line
# Save futher interation as the next sentance is i/we/you
break
break
else:
message = message + ' ' + token[0]
result += [dict(owner=owner, message=message, date=date)]
print(result)
Please guide how to extract the actions(action demo, UI) from the paragraph.
If you're using NLTK, you can get the POS tags of your tokens and come up with a regex or pattern using those tags. For example, an action will be a verb. (For better tagging, you may require Spacy. There is another library called Pattern for these purposes)
But I'm not sure if this is going to help you a lot for a scaled application.
N.B: There are well-trained Named Entity Recognizers available, you may try them.
Here is my thoughts:
If i try to identify parts of speech for your sentence using nltk.tag.pos_tag , i get below:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
s = 'Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july'
tokens = word_tokenize(s)
print(pos_tag(tokens))
Output:
[('Hi', 'NNP'), ('prakash', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.'), ('.', '.'), ('We', 'PRP'), ('need', 'VBP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('action', 'NN'), ('by', 'IN'), ('8', 'CD'), ('June', 'NNP'), ('then', 'RB'), ('you', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('to', 'TO'), ('finish', 'VB'), ('the', 'DT'), ('UI', 'NNP'), ('by', 'IN'), ('15', 'CD'), ('july', 'NN')]
If you observe, every action word i.e. "speech to action" or "UI" occur after a preceding verb tag i.e 'complete' and 'finish' respectively.
I would suggest to try this problem with below steps:
1) Find verb in a sentence.(something like below)
for i in range(len(tokens)):
if pos_tag(tokens)[][1] == 'VB':
2) If found, then fetch the next words based on their pos tags. (may be retrieve all next words until you find 'IN' tag)
This may work for your dataset.

nltk tokenize measurement units

I'm trying to extract measurements from a messy dataset.
Some basic example entries would be:
1.5 gram of paracetamol
1.5g of paracetamol
1.5 grams. of paracetamol
I'm trying to extract the measurement and units for each entry so the result for all the above should be:
(1.5, g)
Some other questions proposed the use of NLTK for such a task but I'm running into trouble when doing the following:
import nltk
s1 = "1.5g of paracetamol"
s2 = "1.5 gram of paracetamol"
words_s1 = nltk.word_tokenize(s1)
words_s2 = nltk.word_tokenize(s2)
nltk.pos_tag(words_s1)
nltk.pos_tag(words_s2)
Which returns
[('1.5g', 'CD'), ('of', 'IN'), ('paracetamol', 'NN')]
[('1.5', 'CD'), ('gram', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
The problem is that the unit 'g' is being kept as part of the CD in the first example. How could I get the following result?
[('1.5', 'CD'), ('g', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
On the real data set the units are much more varied (mg, miligrams, kg, kgrams. ...)
Thanks!
You must tokenize the sentence yourself using nltk.regexp_tokenize, for example:
words_s1 = nltk.regexp_tokenize(s1, r'(?u)\d+(?:\.\d+)?|\w+')
Obviously, it needs to be improved to deal with more complicated cases.

How to add compound words to the tagger in NLTK?

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK..
For example, when I do:
nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))
It gives me:
[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]
How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')
You could try taking a look at nltk.RegexParser. It allows you to chunk part of speech tagged content based on regular expressions.
In your example, you could do something like
pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t
This would give you:
Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])
The code is doing exactly what it is supposed to do. It is adding Part Of Speech tags to tokens. 'Apple Incorporated' is not a single token. It is two separate tokens, and as such can't have a single POS tag applied to it. This is the correct behaviour.
I wonder if you are trying to use the wrong tool for the job. What are you trying to do / Why are you trying to do it? Perhaps you are interested in identifying collocations rather than POS tagging? You might have a look here:
collocations module

nltk custom tokenizer and tagger

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.
The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)
You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Categories

Resources