nltk tokenize measurement units - python

I'm trying to extract measurements from a messy dataset.
Some basic example entries would be:
1.5 gram of paracetamol
1.5g of paracetamol
1.5 grams. of paracetamol
I'm trying to extract the measurement and units for each entry so the result for all the above should be:
(1.5, g)
Some other questions proposed the use of NLTK for such a task but I'm running into trouble when doing the following:
import nltk
s1 = "1.5g of paracetamol"
s2 = "1.5 gram of paracetamol"
words_s1 = nltk.word_tokenize(s1)
words_s2 = nltk.word_tokenize(s2)
nltk.pos_tag(words_s1)
nltk.pos_tag(words_s2)
Which returns
[('1.5g', 'CD'), ('of', 'IN'), ('paracetamol', 'NN')]
[('1.5', 'CD'), ('gram', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
The problem is that the unit 'g' is being kept as part of the CD in the first example. How could I get the following result?
[('1.5', 'CD'), ('g', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
On the real data set the units are much more varied (mg, miligrams, kg, kgrams. ...)
Thanks!

You must tokenize the sentence yourself using nltk.regexp_tokenize, for example:
words_s1 = nltk.regexp_tokenize(s1, r'(?u)\d+(?:\.\d+)?|\w+')
Obviously, it needs to be improved to deal with more complicated cases.

Related

Incorrect POS Tagging with NLTK

I am trying to implement chunking using NLTK. But, I am having problems while assigning POS Tags to the words. It gives a different output when the data is passed as it is in a file and the tagging is wrong in many cases. When I convert the text to lowercase, it correctly assigns the tags. But the chunking is not correct.
My Code
import nltk
from nltk.corpus import state_union
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sample_text = state_union.raw("2006-GWBush.txt")
words = nltk.word_tokenize(sample_text.lower()) # it makes correct pos tags in this case
words = nltk.word_tokenize(sample_text) # it does not make correct pos tags in this case
tagged = nltk.pos_tag(words)
print(tagged)
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
print(subtree)
Output when POS Tagging is correct:
[('president', 'NN'), ('george', 'NN'), ('w.', 'VBD'), ('bush', 'NN'), ("'s", 'POS'), ('address', 'NN'), ('before', 'IN'), ('a', 'DT'), ('joint', 'JJ'), ('session', 'NN'),
and so on....
there are no chunks printed
Output when POS Tagging is incorrect:
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
and so on...
chunks are printed in this case but due to wrong POS tags, the chunks are not correct.
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
why the POS tagging is producing different results when the words are in different cases? I am new to NLTK therefore unable to get the issue. Thanks
Data file
The data file used can be found here in the NLTK library. It is already there in NLTK when you download it in your system. You won't need to import it separately.

Tokenizing and POS tagging in Python from CSV file

I am a newbie in Python and would like to do POS tagging after importing csv file from my local machine. I looked up some resources from online and found that the following code works.
text = 'Senator Elizabeth Warren from Massachusetts announced her support of
Social Security in Washington, D.C. on Tuesday. Warren joined other
Democrats in support.'
import nltk
from nltk import tokenize
sentences = tokenize.sent_tokenize(text)
sentences
from nltk.tokenize import TreebankWordTokenizer
texttokens = []
for sent in sentences:
texttokens.append(TreebankWordTokenizer().tokenize(sent))
texttokens
from nltk.tag import pos_tag
taggedsentences = []
for sentencetokens in texttokens:
taggedsentences.append(pos_tag(sentencetokens))
taggedsentences
print(taggedsentences)
Since I printed it, the result from the code above looks like this.
[[('Senator', 'NNP'), ('Elizabeth', 'NNP'), ('Warren', 'NNP'), ('from',
'IN'), ('Massachusetts', 'NNP'), ('announced', 'VBD'), ('her', 'PRP$'),
('support', 'NN'), ('of', 'IN'), ('Social', 'NNP'), ('Security', 'NNP'),
('in', 'IN'), ('Washington', 'NNP'), (',', ','), ('D.C.', 'NNP'), ('on',
'IN'), ('Tuesday', 'NNP'), ('.', '.')], [('Warren', 'NNP'), ('joined',
'VBD'), ('other', 'JJ'), ('Democrats', 'NNPS'), ('in', 'IN'), ('support',
'NN'), ('.', '.')]]
This is a desirable result that I would like to get, but I would like to get the result after importing csv file which contains several rows (in each row, there are several sentences.). For example, the csv file looks like this:
---------------------------------------------------------------
I like this product. This product is beautiful. I love it.
---------------------------------------------------------------
This product is awesome. It have many convenient features.
---------------------------------------------------------------
I went this restaurant three days ago. The food is too bad.
---------------------------------------------------------------
In the end, I would like to save the desirable pos tagging results that I displayed above after importing the csv file. I would like to save (write) the (pos tagged) each sentence in each row as a csv format.
Two formats might be possible. First one might be as follows (no header, each (pos tagged) sentence in one row).
----------------------------------------------------------------------------
[[('I', 'PRON'), ('like', 'VBD'), ('this', 'PRON'), ('product', 'NN')]]
----------------------------------------------------------------------------
[[('This', 'PRON'), ('product', 'NN'), ('is', 'VERB'), ('beautiful', 'ADJ')]]
---------------------------------------------------------------------------
[[('I', 'PRON'), ('love', 'VERB'), ('it', 'PRON')]]
----------------------------------------------------------------------------
...
The second format might look like this (no header, each set of token and pos tagger saved in one cell):
----------------------------------------------------------------------------
('I', 'PRON') | ('like', 'VBD') | ('this', 'PRON') | ('product', 'NN')
----------------------------------------------------------------------------
('This', 'PRON') | ('product', 'NN') | ('is', 'VERB') | ('beautiful', 'ADJ')
---------------------------------------------------------------------------
('I', 'PRON') | ('love', 'VERB') | ('it', 'PRON') |
----------------------------------------------------------------------------
...
I prefer the second format to the first one.
The python code that I wrote here perfectly works but I would like to do the same thing for csv file and in the end save it in my local machine.
Final purpose of doing this is that I would like to extract only noun types of words (e.g., NN, NNP) from the sentences.
Can somebody help me how to fix the python code?
Please refer to the question already answered here. You can just do some tagging to filter out just the Nouns as described in the post.SO Link

NLTK RegexpParser, chunk phrase by matching exactly one item

I'm using NLTK's RegexpParser to chunk a noun phrase, which I define with a grammar as
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
cp = RegexpParser(grammar)
This is grand, it is matching a noun phrase as:
DT if it exists
JJ in whatever number
NN or NNS, at least one
Now, what if I want to match the same but having the whatever number for JJ transformed into only one? So I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
The grammar
grammar = "NP: {<DT>?<JJ><NN|NNS>+}"
would match only when there is just one JJ, the grammar
grammar = "NP: {<DT>?<JJ>{1}<NN|NNS>+}"
which I thought would work given the typical Regexp patterns, raises a ValueError.
For example, in "This beautiful green skirt", I'd like to chunk "This green skirt".
So, how would I proceed?
Grammer grammar = "NP: {<DT>?<JJ><NN|NNS>+}" is correct for your mentioned requirement.
The example which you gave in comment section, where you are not getting DT in output -
"This beautiful green skirt is for you."
Tree('S', [('This', 'DT'), ('beautiful', 'JJ'), Tree('NP', [('green','JJ'),
('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
Here in your example, there are 2 consecutive JJs which does not meet your requirements as you said - "I want to match DT if it exists, one JJ and 1+ NN/NNS."
For updated requirement -
I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).
Here, you will need to use
grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
and do post processing of the NP chunks to remove extra JJ.
Code:
from nltk import Tree
chunk_output = Tree('S', [Tree('NP', [('This', 'DT'), ('beautiful', 'JJ'), ('green','JJ'), ('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])
for child in chunk_output:
if isinstance(child, Tree):
if child.label() == 'NP':
for num in range(len(child)):
if not (child[num][1]=='JJ' and child[num+1][1]=='JJ'):
print child[num][0]
Output:
This
green
skirt

How to add compound words to the tagger in NLTK?

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK..
For example, when I do:
nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))
It gives me:
[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]
How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')
You could try taking a look at nltk.RegexParser. It allows you to chunk part of speech tagged content based on regular expressions.
In your example, you could do something like
pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t
This would give you:
Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])
The code is doing exactly what it is supposed to do. It is adding Part Of Speech tags to tokens. 'Apple Incorporated' is not a single token. It is two separate tokens, and as such can't have a single POS tag applied to it. This is the correct behaviour.
I wonder if you are trying to use the wrong tool for the job. What are you trying to do / Why are you trying to do it? Perhaps you are interested in identifying collocations rather than POS tagging? You might have a look here:
collocations module

Extract product name from english text

I want extract the names of products being sold from English text.
For example:
"I'm selling my xbox brand new"
"Selling rarely used 27 inch TV"
Should give me "xbox" and "27 inch TV"
The only thing I can think of at the moment is to hardcode in a giant list of important nouns and important adjectives: ['tv', 'fridge', 'xbox', 'laptop', etc]
Is there a better approach?
It looks like nltk will give you a list of words and their parts of speech. Since you are only interested in nouns? this will provide you with them
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]

Categories

Resources