Is there any way to find proper nouns using NLTK WordNet?Ie., Can i tag Possessive nouns using nltk Wordnet ?
I don't think you need WordNet to find proper nouns, I suggest using the Part-Of-Speech tagger pos_tag.
To find Proper Nouns, look for the NNP tag:
from nltk.tag import pos_tag
sentence = "Michael Jackson likes to eat at McDonalds"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('eat', 'VB'), ('at', 'IN'), ('McDonalds', 'NNP')]
propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
# ['Michael','Jackson', 'McDonalds']
You may not be very satisfied since Michael and Jackson is split into 2 tokens, then you might need something more complex such as Name Entity tagger.
By right, as documented by the penntreebank tagset, for possessive nouns, you can simply look for the POS tag, http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html. But often the tagger doesn't tag POS when it's an NNP.
To find Possessive Nouns, look for str.endswith("'s") or str.endswith("s'"):
from nltk.tag import pos_tag
sentence = "Michael Jackson took Daniel Jackson's hamburger and Agnes' fries"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('took', 'VBD'), ('Daniel', 'NNP'), ("Jackson's", 'NNP'), ('hamburger', 'NN'), ('and', 'CC'), ("Agnes'", 'NNP'), ('fries', 'NNS')]
possessives = [word for word in sentence if word.endswith("'s") or word.endswith("s'")]
# ["Jackson's", "Agnes'"]
Alternatively, you can use NLTK ne_chunk but it doesn't seem to do much other unless you are concerned about what kind of Proper Noun you get from the sentence:
>>> from nltk.tree import Tree; from nltk.chunk import ne_chunk
>>> [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), Tree('PERSON', [('Daniel', 'NNP')])]
>>> [i[0] for i in list(chain(*[chunk.leaves() for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]))]
['Michael', 'Jackson', 'Daniel']
Using ne_chunk is a little verbose and it doesn't get you the possessives.
I think what you need is a tagger, a part-of-speech tagger. This tool assigns a part-of-speech tag (e.g., proper noun, possesive pronoun etc) to each word in a sentence.
NLTK includes some taggers:
http://nltk.org/book/ch05.html
There's also the Stanford Part-Of-Speech Tagger (open source too, better performance).
Related
I have a text and I want to find number of 'ADJs','PRONs', 'VERBs', 'NOUNs' etc.
I know that there is .pos_tag() function but it gives me different results , and I want to have results as 'ADJ','PRON', 'VERB', 'NOUN'.
This is my code:
import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
sentence = "this is my sample text that I want to analyze with programming language"
# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()
# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()
# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)
for w_type, num in count_of_word_type_list:
print(w_type, num)
print()
The code above works but I want to find a way to get this type of tags:
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy
I saw that there is a chapter here: https://www.nltk.org/book/ch05.html
That says:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
But I do not know how to apply that on my sample sentence.
Thanks for your help.
From https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L135
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
# Default Penntreebank tagset.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
# Universal POS tags.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
I am looking for a way to influence the behavior of the IoB tagging of pythons NLTK.
Consider the following piece of code:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.stem import PorterStemmer
from nltk.tag import untag, str2tuple, tuple2str
from nltk.chunk import tree2conllstr, conllstr2tree, conlltags2tree, tree2conlltags
import nltk
text = "Drive me from Seattle to Brussels"
# Morphology - tagging the words
tokens = word_tokenize(text)
# Part of speech tagging
tagged_tokens = pos_tag(tokens)
# Create named entity tree of tagged tokens
ner_tree = ne_chunk(tagged_tokens)
# Get tag structure
iob_tagged = tree2conlltags(ner_tree)
print(iob_tagged)
This outputs the following values:
[('Drive', 'VB', 'O'), ('me', 'PRP', 'O'), ('from', 'IN', 'O'), ('Seattle', 'NNP', 'B-GPE'), ('to', 'TO', 'O'), ('Brussels', 'VB', 'O')]
Is there a way I can influence or tune the NLTK algorithm/model in a way that the last word (Brussels) is tagged as a Geopolitical Entity (GPE), instead of a verb? I understand the verb is there, because it is following the To word, which is often used prior to verbs.
I was banging my head with the python's TextBlob package that
identifies sentences from paragraphs
identifies words from sentences
determines POS(Part of Speech) tags for those words, etc...
Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.
from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample) #Passing it to TextBlob package.
Words = blob.words #Splitting the Sentence into words.
Tags = blob.tags #Determining POS tag for each words in the sentence
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']
As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.
Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.
I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.
Here are my questions.
Is this possible issue in TextBlob package by any chance ?
And is there any way to identify '%' in the Words list ?
Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624
They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.
When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.
EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.
So you could easily pass TextBlob a tokenizer that respects punctuation.
respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.
Finally I have found out that NLTK is identifying the symbols properly. The code snippet for the same is given below for reference :
from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)
print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
I'm trying to chunk a sentence using ne_chunk and pos_tag in nltk.
from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk
sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())
print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
print print_chunk
and this is the result:
[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]
my question, is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
and what 'GPE' means?
Thanks in advance
The named entity chunker will give you a tree containing both chunks and tags. You can't change that, but you can take the tags out. Starting from your tagged_sent:
chunks = nltk.ne_chunk(tagged_sent)
simple = []
for elt in chunks:
if isinstance(elt, Tree):
simple.append(Tree(elt.label(), [ word for word, tag in elt ]))
else:
simple.append( elt[0] )
If you only want the chunks, omit the else: clause in the above. You can adapt the code to wrap the chunks any way you want. I used an nltk Tree to keep the changes to a minimum. Note that some chunks consist of multiple words (try adding "New York" to your example), so the chunk's contents must be a list, not a single element.
PS. "GPE" stands for "geo-political entity" (obviously a chunker mistake). You can see a list of the "commonly used tags" in the nltk book, here.
Most probably a slight modification to the code on https://stackoverflow.com/a/31838373/610569 with the tags is what you require.
is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?
Yes, simply traverse the Tree object =) See How to Traverse an NLTK Tree object?
>>> from nltk import Tree, pos_tag, ne_chunk
>>> sentence = "Michael and John is reading a booklet in a library of Jakarta"
>>> tagged_sent = ne_chunk(pos_tag(sentence.split()))
>>> tagged_sent
Tree('S', [Tree('GPE', [('Michael', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('booklet', 'NN'), ('in', 'IN'), ('a', 'DT'), ('library', 'NN'), ('of', 'IN'), Tree('GPE', [('Jakarta', 'NNP')])])
>>> from nltk.sem.relextract import NE_CLASSES
>>> ace_tags = NE_CLASSES['ace']
>>> for node in tagged_sent:
... if type(node) == Tree and node.label() in ace_tags:
... words, tags = zip(*node.leaves())
... print node.label() + '\t' + ' '.join(words)
...
GPE Michael
PERSON John
GPE Jakarta
What 'GPE' means?
GPE means "Geo-Political Entity"
The GPE tag came from the ACE dataset
There are two pre-trained NE chunkers available, see https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py#L164
There are 3 tag sets that are supported: https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L31
For a detailed explanation, see NLTK relation extraction returns nothing
Chapter 5 of the Python NLTK book gives this example of tagging words in a sentence:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
nltk.pos_tag calls the default tagger, which uses a full set of tags. Later in the chapter a simplified set of tags is introduced.
How can I tag sentences with this simplified set of part-of-speech tags?
Also have I understood the tagger correctly, i.e. can I change the tag set that the tagger uses as I'm asking, or should I map the tags it returns on to the simplified set, or should I create a new tagger from a new, simply-tagged corpus?
Updated, in case anyone runs across the same problem. NLTK has since upgraded to a "universal" tagset, source here. Once you've tagged your text, use map_tag to simplify the tags.
import nltk
from nltk.tag import pos_tag, map_tag
text = nltk.word_tokenize("And now for something completely different")
posTagged = pos_tag(text)
simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged]
print(simplifiedTags)
# [('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]
To simplify tags from the default tagger, you can use nltk.tag.simplify.simplify_wsj_tag, like so:
>>> import nltk
>>> from nltk.tag.simplify import simplify_wsj_tag
>>> tagged_sent = nltk.pos_tag(tokens)
>>> simplified = [(word, simplify_wsj_tag(tag)) for word, tag in tagged_sent]
You can simply set the tagset attribute to 'universal' in the pos_tag method.
In [39]: from nltk import word_tokenize, pos_tag
...:
...: text = word_tokenize("Here is a simple way of doing this")
...: tags = pos_tag(text, tagset='universal')
...: print(tags)
...:
[('Here', 'ADV'), ('is', 'VERB'), ('a', 'DET'), ('simple', 'ADJ'), ('way', 'NOUN'), ('of', 'ADP'), ('doing', 'VERB'), ('this', 'DET')]