Extract action/task from text using nltk - python

Hi i am using nltk first time and i want to Extract action/task from text using nltk
Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july
Above here the speech to action and UI is the action.
I have started the token creation, don't know what to do next, Please guide.
from nltk import sent_tokenize
sample_text ="""Hi prakash, how are you ?. We need to complete the speech to action demo by 8 June then you will have to finish the Ui by 15 july"""
sentences = sent_tokenize(sample_text)
print(sentences) import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
sample_text = """Hi prakash, how are you ?. We need to complete the speech to action by today
then you will have to finish the UI by 15 july after that you may go finish the mobile view"""
sample_text = "need to complete the speech to action by today"
tokens = word_tokenize(sample_text.lower())
# the lower is very much required, as June and june have diffrent code NN, NNP
pos_tags = pos_tag(tokens)
result = []
for i in range(len(tokens)):
if (pos_tags[i][1] == 'VB') and (pos_tags[i][0] in ['complete','finish']):
# Here we are looking for text like (finish, complete, done)
owner = ''
for back_tag in pos_tags[:i][::-1]:
#traverse in back direction to know the owner who will (finish, complete, done)
if back_tag[1]=='PRP':
owner = back_tag[0]
break
message = ''
date = ''
for messae_index , token in enumerate(pos_tags[i:],i):
#traverse forward to know what has to be done
if token[1]=='IN':
for date_index, date_lookup in enumerate(pos_tags[messae_index:],messae_index):
if date_lookup[1]=='NN':
date = pos_tags[date_index-1][0] + ' ' + pos_tags[date_index][0]
if date_lookup[1]=='PRP':
# This is trick to stop further propegation
# Don't ask me why i am doing this, if you are still reading then read the nest line
# Save futher interation as the next sentance is i/we/you
break
break
else:
message = message + ' ' + token[0]
result += [dict(owner=owner, message=message, date=date)]
print(result)
Please guide how to extract the actions(action demo, UI) from the paragraph.

If you're using NLTK, you can get the POS tags of your tokens and come up with a regex or pattern using those tags. For example, an action will be a verb. (For better tagging, you may require Spacy. There is another library called Pattern for these purposes)
But I'm not sure if this is going to help you a lot for a scaled application.
N.B: There are well-trained Named Entity Recognizers available, you may try them.

Here is my thoughts:
If i try to identify parts of speech for your sentence using nltk.tag.pos_tag , i get below:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
s = 'Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july'
tokens = word_tokenize(s)
print(pos_tag(tokens))
Output:
[('Hi', 'NNP'), ('prakash', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.'), ('.', '.'), ('We', 'PRP'), ('need', 'VBP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('action', 'NN'), ('by', 'IN'), ('8', 'CD'), ('June', 'NNP'), ('then', 'RB'), ('you', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('to', 'TO'), ('finish', 'VB'), ('the', 'DT'), ('UI', 'NNP'), ('by', 'IN'), ('15', 'CD'), ('july', 'NN')]
If you observe, every action word i.e. "speech to action" or "UI" occur after a preceding verb tag i.e 'complete' and 'finish' respectively.
I would suggest to try this problem with below steps:
1) Find verb in a sentence.(something like below)
for i in range(len(tokens)):
if pos_tag(tokens)[][1] == 'VB':
2) If found, then fetch the next words based on their pos tags. (may be retrieve all next words until you find 'IN' tag)
This may work for your dataset.

Related

Incorrect POS Tagging with NLTK

I am trying to implement chunking using NLTK. But, I am having problems while assigning POS Tags to the words. It gives a different output when the data is passed as it is in a file and the tagging is wrong in many cases. When I convert the text to lowercase, it correctly assigns the tags. But the chunking is not correct.
My Code
import nltk
from nltk.corpus import state_union
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sample_text = state_union.raw("2006-GWBush.txt")
words = nltk.word_tokenize(sample_text.lower()) # it makes correct pos tags in this case
words = nltk.word_tokenize(sample_text) # it does not make correct pos tags in this case
tagged = nltk.pos_tag(words)
print(tagged)
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
print(subtree)
Output when POS Tagging is correct:
[('president', 'NN'), ('george', 'NN'), ('w.', 'VBD'), ('bush', 'NN'), ("'s", 'POS'), ('address', 'NN'), ('before', 'IN'), ('a', 'DT'), ('joint', 'JJ'), ('session', 'NN'),
and so on....
there are no chunks printed
Output when POS Tagging is incorrect:
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
and so on...
chunks are printed in this case but due to wrong POS tags, the chunks are not correct.
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
why the POS tagging is producing different results when the words are in different cases? I am new to NLTK therefore unable to get the issue. Thanks
Data file
The data file used can be found here in the NLTK library. It is already there in NLTK when you download it in your system. You won't need to import it separately.

How to use the universal POS tags with nltk.pos_tag() function?

I have a text and I want to find number of 'ADJs','PRONs', 'VERBs', 'NOUNs' etc.
I know that there is .pos_tag() function but it gives me different results , and I want to have results as 'ADJ','PRON', 'VERB', 'NOUN'.
This is my code:
import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
sentence = "this is my sample text that I want to analyze with programming language"
# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()
# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()
# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)
for w_type, num in count_of_word_type_list:
print(w_type, num)
print()
The code above works but I want to find a way to get this type of tags:
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy
I saw that there is a chapter here: https://www.nltk.org/book/ch05.html
That says:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
But I do not know how to apply that on my sample sentence.
Thanks for your help.
From https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L135
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
# Default Penntreebank tagset.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
# Universal POS tags.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

Python TextBlob Package - Determines POS tag for '%' symbol but do not print it as a word

I was banging my head with the python's TextBlob package that
identifies sentences from paragraphs
identifies words from sentences
determines POS(Part of Speech) tags for those words, etc...
Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.
from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample) #Passing it to TextBlob package.
Words = blob.words #Splitting the Sentence into words.
Tags = blob.tags #Determining POS tag for each words in the sentence
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']
As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.
Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.
I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.
Here are my questions.
Is this possible issue in TextBlob package by any chance ?
And is there any way to identify '%' in the Words list ?
Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624
They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.
When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.
EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.
So you could easily pass TextBlob a tokenizer that respects punctuation.
respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.
Finally I have found out that NLTK is identifying the symbols properly. The code snippet for the same is given below for reference :
from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)
print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

Finding Proper Nouns using NLTK WordNet

Is there any way to find proper nouns using NLTK WordNet?Ie., Can i tag Possessive nouns using nltk Wordnet ?
I don't think you need WordNet to find proper nouns, I suggest using the Part-Of-Speech tagger pos_tag.
To find Proper Nouns, look for the NNP tag:
from nltk.tag import pos_tag
sentence = "Michael Jackson likes to eat at McDonalds"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('eat', 'VB'), ('at', 'IN'), ('McDonalds', 'NNP')]
propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
# ['Michael','Jackson', 'McDonalds']
You may not be very satisfied since Michael and Jackson is split into 2 tokens, then you might need something more complex such as Name Entity tagger.
By right, as documented by the penntreebank tagset, for possessive nouns, you can simply look for the POS tag, http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html. But often the tagger doesn't tag POS when it's an NNP.
To find Possessive Nouns, look for str.endswith("'s") or str.endswith("s'"):
from nltk.tag import pos_tag
sentence = "Michael Jackson took Daniel Jackson's hamburger and Agnes' fries"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('took', 'VBD'), ('Daniel', 'NNP'), ("Jackson's", 'NNP'), ('hamburger', 'NN'), ('and', 'CC'), ("Agnes'", 'NNP'), ('fries', 'NNS')]
possessives = [word for word in sentence if word.endswith("'s") or word.endswith("s'")]
# ["Jackson's", "Agnes'"]
Alternatively, you can use NLTK ne_chunk but it doesn't seem to do much other unless you are concerned about what kind of Proper Noun you get from the sentence:
>>> from nltk.tree import Tree; from nltk.chunk import ne_chunk
>>> [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), Tree('PERSON', [('Daniel', 'NNP')])]
>>> [i[0] for i in list(chain(*[chunk.leaves() for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]))]
['Michael', 'Jackson', 'Daniel']
Using ne_chunk is a little verbose and it doesn't get you the possessives.
I think what you need is a tagger, a part-of-speech tagger. This tool assigns a part-of-speech tag (e.g., proper noun, possesive pronoun etc) to each word in a sentence.
NLTK includes some taggers:
http://nltk.org/book/ch05.html
There's also the Stanford Part-Of-Speech Tagger (open source too, better performance).

nltk custom tokenizer and tagger

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.
The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)
You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Categories

Resources