nltk custom tokenizer and tagger - python

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Related

Extract action/task from text using nltk

Hi i am using nltk first time and i want to Extract action/task from text using nltk
Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july
Above here the speech to action and UI is the action.
I have started the token creation, don't know what to do next, Please guide.
from nltk import sent_tokenize
sample_text ="""Hi prakash, how are you ?. We need to complete the speech to action demo by 8 June then you will have to finish the Ui by 15 july"""
sentences = sent_tokenize(sample_text)
print(sentences) import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
sample_text = """Hi prakash, how are you ?. We need to complete the speech to action by today
then you will have to finish the UI by 15 july after that you may go finish the mobile view"""
sample_text = "need to complete the speech to action by today"
tokens = word_tokenize(sample_text.lower())
# the lower is very much required, as June and june have diffrent code NN, NNP
pos_tags = pos_tag(tokens)
result = []
for i in range(len(tokens)):
if (pos_tags[i][1] == 'VB') and (pos_tags[i][0] in ['complete','finish']):
# Here we are looking for text like (finish, complete, done)
owner = ''
for back_tag in pos_tags[:i][::-1]:
#traverse in back direction to know the owner who will (finish, complete, done)
if back_tag[1]=='PRP':
owner = back_tag[0]
break
message = ''
date = ''
for messae_index , token in enumerate(pos_tags[i:],i):
#traverse forward to know what has to be done
if token[1]=='IN':
for date_index, date_lookup in enumerate(pos_tags[messae_index:],messae_index):
if date_lookup[1]=='NN':
date = pos_tags[date_index-1][0] + ' ' + pos_tags[date_index][0]
if date_lookup[1]=='PRP':
# This is trick to stop further propegation
# Don't ask me why i am doing this, if you are still reading then read the nest line
# Save futher interation as the next sentance is i/we/you
break
break
else:
message = message + ' ' + token[0]
result += [dict(owner=owner, message=message, date=date)]
print(result)
Please guide how to extract the actions(action demo, UI) from the paragraph.
If you're using NLTK, you can get the POS tags of your tokens and come up with a regex or pattern using those tags. For example, an action will be a verb. (For better tagging, you may require Spacy. There is another library called Pattern for these purposes)
But I'm not sure if this is going to help you a lot for a scaled application.
N.B: There are well-trained Named Entity Recognizers available, you may try them.
Here is my thoughts:
If i try to identify parts of speech for your sentence using nltk.tag.pos_tag , i get below:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
s = 'Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july'
tokens = word_tokenize(s)
print(pos_tag(tokens))
Output:
[('Hi', 'NNP'), ('prakash', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.'), ('.', '.'), ('We', 'PRP'), ('need', 'VBP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('action', 'NN'), ('by', 'IN'), ('8', 'CD'), ('June', 'NNP'), ('then', 'RB'), ('you', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('to', 'TO'), ('finish', 'VB'), ('the', 'DT'), ('UI', 'NNP'), ('by', 'IN'), ('15', 'CD'), ('july', 'NN')]
If you observe, every action word i.e. "speech to action" or "UI" occur after a preceding verb tag i.e 'complete' and 'finish' respectively.
I would suggest to try this problem with below steps:
1) Find verb in a sentence.(something like below)
for i in range(len(tokens)):
if pos_tag(tokens)[][1] == 'VB':
2) If found, then fetch the next words based on their pos tags. (may be retrieve all next words until you find 'IN' tag)
This may work for your dataset.

Python TextBlob Package - Determines POS tag for '%' symbol but do not print it as a word

I was banging my head with the python's TextBlob package that
identifies sentences from paragraphs
identifies words from sentences
determines POS(Part of Speech) tags for those words, etc...
Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.
from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample) #Passing it to TextBlob package.
Words = blob.words #Splitting the Sentence into words.
Tags = blob.tags #Determining POS tag for each words in the sentence
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']
As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.
Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.
I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.
Here are my questions.
Is this possible issue in TextBlob package by any chance ?
And is there any way to identify '%' in the Words list ?
Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624
They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.
When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.
EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.
So you could easily pass TextBlob a tokenizer that respects punctuation.
respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.
Finally I have found out that NLTK is identifying the symbols properly. The code snippet for the same is given below for reference :
from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)
print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

ne_chunk without pos_tag in NLTK

I'm trying to chunk a sentence using ne_chunk and pos_tag in nltk.
from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk
sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())
print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
print print_chunk
and this is the result:
[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]
my question, is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
and what 'GPE' means?
Thanks in advance
The named entity chunker will give you a tree containing both chunks and tags. You can't change that, but you can take the tags out. Starting from your tagged_sent:
chunks = nltk.ne_chunk(tagged_sent)
simple = []
for elt in chunks:
if isinstance(elt, Tree):
simple.append(Tree(elt.label(), [ word for word, tag in elt ]))
else:
simple.append( elt[0] )
If you only want the chunks, omit the else: clause in the above. You can adapt the code to wrap the chunks any way you want. I used an nltk Tree to keep the changes to a minimum. Note that some chunks consist of multiple words (try adding "New York" to your example), so the chunk's contents must be a list, not a single element.
PS. "GPE" stands for "geo-political entity" (obviously a chunker mistake). You can see a list of the "commonly used tags" in the nltk book, here.
Most probably a slight modification to the code on https://stackoverflow.com/a/31838373/610569 with the tags is what you require.
is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
Yes, simply traverse the Tree object =) See How to Traverse an NLTK Tree object?
>>> from nltk import Tree, pos_tag, ne_chunk
>>> sentence = "Michael and John is reading a booklet in a library of Jakarta"
>>> tagged_sent = ne_chunk(pos_tag(sentence.split()))
>>> tagged_sent
Tree('S', [Tree('GPE', [('Michael', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('booklet', 'NN'), ('in', 'IN'), ('a', 'DT'), ('library', 'NN'), ('of', 'IN'), Tree('GPE', [('Jakarta', 'NNP')])])
>>> from nltk.sem.relextract import NE_CLASSES
>>> ace_tags = NE_CLASSES['ace']
>>> for node in tagged_sent:
... if type(node) == Tree and node.label() in ace_tags:
... words, tags = zip(*node.leaves())
... print node.label() + '\t' + ' '.join(words)
...
GPE Michael
PERSON John
GPE Jakarta
What 'GPE' means?
GPE means "Geo-Political Entity"
The GPE tag came from the ACE dataset
There are two pre-trained NE chunkers available, see https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py#L164
There are 3 tag sets that are supported: https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L31
For a detailed explanation, see NLTK relation extraction returns nothing

NLTK RegexpParser, chunk phrase by matching exactly one item

I'm using NLTK's RegexpParser to chunk a noun phrase, which I define with a grammar as
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
cp = RegexpParser(grammar)
This is grand, it is matching a noun phrase as:
DT if it exists
JJ in whatever number
NN or NNS, at least one
Now, what if I want to match the same but having the whatever number for JJ transformed into only one? So I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
The grammar
grammar = "NP: {<DT>?<JJ><NN|NNS>+}"
would match only when there is just one JJ, the grammar
grammar = "NP: {<DT>?<JJ>{1}<NN|NNS>+}"
which I thought would work given the typical Regexp patterns, raises a ValueError.
For example, in "This beautiful green skirt", I'd like to chunk "This green skirt".
So, how would I proceed?
Grammer grammar = "NP: {<DT>?<JJ><NN|NNS>+}" is correct for your mentioned requirement.
The example which you gave in comment section, where you are not getting DT in output -
"This beautiful green skirt is for you."
Tree('S', [('This', 'DT'), ('beautiful', 'JJ'), Tree('NP', [('green','JJ'),
('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
Here in your example, there are 2 consecutive JJs which does not meet your requirements as you said - "I want to match DT if it exists, one JJ and 1+ NN/NNS."
For updated requirement -
I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
Here, you will need to use
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
and do post processing of the NP chunks to remove extra JJ.
Code:
from nltk import Tree
chunk_output = Tree('S', [Tree('NP', [('This', 'DT'), ('beautiful', 'JJ'), ('green','JJ'), ('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
for child in chunk_output:
if isinstance(child, Tree):
if child.label() == 'NP':
for num in range(len(child)):
if not (child[num][1]=='JJ' and child[num+1][1]=='JJ'):
print child[num][0]
Output:
This
green
skirt

Extract associated values from text using NLP

I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.
Example:
for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}
for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}
line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)
The above line would give the output:
[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]
Is there a way to extract the CD and UOM values from the text?
Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:
import nltk
sentence = 'This bucket holds 5 litres of water.'
parser = nltk.RegexpParser(
"""
INDICATOR: {<CD><NNS>}
""")
print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))
Output:
(S
This/DT
bucket/NN
holds/VBZ
(INDICATOR 5/CD litres/NNS)
of/IN
water/NN
./.)
You can also create a corpus and train a chunker.
Hm, not sure if it helps - but I wrote it in Javascript.
Here:
http://github.com/redaktor/nlp_compromise
It might be a bit undocumented yet but the guys are porting it to a 2.0 branch now.
It should be easy to port to python considering
What's different between Python and Javascript regular expressions?
And : Did you check pythons NLTK? : http://www.nltk.org

Categories

Resources