I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.
Example:
for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}
for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}
line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)
The above line would give the output:
[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]
Is there a way to extract the CD and UOM values from the text?
Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:
import nltk
sentence = 'This bucket holds 5 litres of water.'
parser = nltk.RegexpParser(
"""
INDICATOR: {<CD><NNS>}
""")
print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))
Output:
(S
This/DT
bucket/NN
holds/VBZ
(INDICATOR 5/CD litres/NNS)
of/IN
water/NN
./.)
You can also create a corpus and train a chunker.
Hm, not sure if it helps - but I wrote it in Javascript.
Here:
http://github.com/redaktor/nlp_compromise
It might be a bit undocumented yet but the guys are porting it to a 2.0 branch now.
It should be easy to port to python considering
What's different between Python and Javascript regular expressions?
And : Did you check pythons NLTK? : http://www.nltk.org
Related
I was banging my head with the python's TextBlob package that
identifies sentences from paragraphs
identifies words from sentences
determines POS(Part of Speech) tags for those words, etc...
Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.
from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample) #Passing it to TextBlob package.
Words = blob.words #Splitting the Sentence into words.
Tags = blob.tags #Determining POS tag for each words in the sentence
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']
As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.
Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.
I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.
Here are my questions.
Is this possible issue in TextBlob package by any chance ?
And is there any way to identify '%' in the Words list ?
Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624
They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.
When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.
EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.
So you could easily pass TextBlob a tokenizer that respects punctuation.
respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.
Finally I have found out that NLTK is identifying the symbols properly. The code snippet for the same is given below for reference :
from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)
print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
I'm trying to chunk a sentence using ne_chunk and pos_tag in nltk.
from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk
sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())
print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
print print_chunk
and this is the result:
[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]
my question, is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
and what 'GPE' means?
Thanks in advance
The named entity chunker will give you a tree containing both chunks and tags. You can't change that, but you can take the tags out. Starting from your tagged_sent:
chunks = nltk.ne_chunk(tagged_sent)
simple = []
for elt in chunks:
if isinstance(elt, Tree):
simple.append(Tree(elt.label(), [ word for word, tag in elt ]))
else:
simple.append( elt[0] )
If you only want the chunks, omit the else: clause in the above. You can adapt the code to wrap the chunks any way you want. I used an nltk Tree to keep the changes to a minimum. Note that some chunks consist of multiple words (try adding "New York" to your example), so the chunk's contents must be a list, not a single element.
PS. "GPE" stands for "geo-political entity" (obviously a chunker mistake). You can see a list of the "commonly used tags" in the nltk book, here.
Most probably a slight modification to the code on https://stackoverflow.com/a/31838373/610569 with the tags is what you require.
is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
Yes, simply traverse the Tree object =) See How to Traverse an NLTK Tree object?
>>> from nltk import Tree, pos_tag, ne_chunk
>>> sentence = "Michael and John is reading a booklet in a library of Jakarta"
>>> tagged_sent = ne_chunk(pos_tag(sentence.split()))
>>> tagged_sent
Tree('S', [Tree('GPE', [('Michael', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('booklet', 'NN'), ('in', 'IN'), ('a', 'DT'), ('library', 'NN'), ('of', 'IN'), Tree('GPE', [('Jakarta', 'NNP')])])
>>> from nltk.sem.relextract import NE_CLASSES
>>> ace_tags = NE_CLASSES['ace']
>>> for node in tagged_sent:
... if type(node) == Tree and node.label() in ace_tags:
... words, tags = zip(*node.leaves())
... print node.label() + '\t' + ' '.join(words)
...
GPE Michael
PERSON John
GPE Jakarta
What 'GPE' means?
GPE means "Geo-Political Entity"
The GPE tag came from the ACE dataset
There are two pre-trained NE chunkers available, see https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py#L164
There are 3 tag sets that are supported: https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L31
For a detailed explanation, see NLTK relation extraction returns nothing
I have a list of consumer product manuals ( about 100,000 .pdf files ) scrapped from the web . now i want to categorize the files by manufacturer/brand and a category it belongs .
For example :
Samsung -> Monitors -> [ files list ]
Samsung -> Mobile Phones -> [ files list ]
etc ...
What have i done so far :
built a list of brands/manufacturers, and a list of categories .
extracted all the data as text from pdf files using pyPdf
tokenized the words from a text data with NLTK
it looks like this :
...
('3Com', 'CD')
('Corporation', 'NNP')
('reserves', 'NNS')
('the', 'DT')
('right', 'NN')
('to', 'TO')
('revise', 'VB')
('this', 'DT')
('documentation', 'NN')
('and', 'CC')
('to', 'TO')
('make', 'VB')
('changes', 'NNS')
('in', 'IN')
('content', 'NN')
('from', 'IN')
...
The problem i face now:
How can i match the tokens against my brand/category lists ?
i have never got a chance to work with NLP before , and i am sort of still trying to wrap my brain around this .
I am not sure this is a NLP issue. Here is how I would do it:
brand_names = ['Samsung', 'Lenovo', ...]
category_names = ['Monitors', 'Mobile Phones', ...]
pdf_string = read_my_pdf('theproduct.pdf')
pdf_string_lowered = pdf_string.lower()
brand_names_in_pdf = [brand.lower() in pdf_string_lowered for brand in brand_names] #Everything is lowered to account for case difference
category_names_in_pdf = [category.lower() in pdf_string_lowered for category in category_names]
import itertools
tags = itertools.product(brand_names_in_pdf, category_names_in_pdf) #Get the tuples of brands and categories
This will seem very simple but I think it will work better than any NLP tool you would be using (how would you know if a specific model number is that of a mobile phone, or maybe some words related to mobile phones will be contained in PDF about something else). I think an exhaustive search is more robust.
The only real drawback of this method is related to variations in the words you are looking for. I think a solution to this would be to use regular expressions instead of tokens. For instance, you could accept 'Mobile Phones' or 'Mobile Phone', and categorize them in 'Mobile Phones'.
I would suggest a hybrid approach. Use a POS tagger to find NNP proper nouns then look them up in a company name dictionary.
This saves you from looking up determiners and other unlikely words. This should increase precision by reducing false positives where someone might use a company name as a verb (xerox, google) for example. On the downside it might reduce recall by increasing false negatives where a company name gets miss tagged and never looked up in your dictionary.
So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK..
For example, when I do:
nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))
It gives me:
[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]
How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')
You could try taking a look at nltk.RegexParser. It allows you to chunk part of speech tagged content based on regular expressions.
In your example, you could do something like
pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t
This would give you:
Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])
The code is doing exactly what it is supposed to do. It is adding Part Of Speech tags to tokens. 'Apple Incorporated' is not a single token. It is two separate tokens, and as such can't have a single POS tag applied to it. This is the correct behaviour.
I wonder if you are trying to use the wrong tool for the job. What are you trying to do / Why are you trying to do it? Perhaps you are interested in identifying collocations rather than POS tagging? You might have a look here:
collocations module
Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.
The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)
You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1