How to exclude prepositions and conjunctions while tokenizing with nltk? [duplicate] - python

This question already has answers here:
How to remove stop words using nltk or python
(13 answers)
Closed 4 years ago.
I have the following sentence:
sentence="The other day I met with Juan and Mary"
And I want to tokenize it but by keeping just the main words, that is: other, day, I, met, Juan, Mary. What I have done so far is tokenize it using nltk library as follows:
tokens=nltk.word_tokenize(sentence)
Which gives me the following:
['The', 'other', 'day', 'I', 'met', 'with', 'Juan', 'and', 'Mary']
I have also tried to tagged the words by using nltk_pos_tag(tokens) obtaining:
[('The', 'DT'), ('other', 'JJ'), ('day', 'NN'), ('I', 'PRP'), ('met', 'VBD'), ('with', 'IN'), ('Juan', 'NNP'), ('and', 'CC'), ('Mary', 'NNP')]
By doing this I could myself delete those words which I don't want as the ones mentioned above as simple as searching their tags and delete the tuple. However, I am wondering if there is a more direct way to do it or if there is a command in nltkthat will do it itself.
Any help would be appreciated! Thank you very much.
Edit: This post doesn't want to eliminate just stopwords but to see the different options one could have to do so as ilustratred above with nltk_pos_tag(tokens).

Like #BoarGules said in comment. It seems like you want to remove stopwords from your sentence. and searching for a direct way to do that so for this i have made a solution for you.
Check this:
import nltk
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #Have around 900 stopwords
nltk_words = list(stopwords.words('english')) #Have around 150 stopwords
stop_words.extend(nltk_words)
sentence = "The other day I met with Juan and Mary" #Your sentence
tokens = nltk.word_tokenize(sentence)
output = []
for words in tokens:
if not words in stop_words:
output.append(words)
print output
It gives you output this:
Output:
['The', 'day', 'I', 'met', 'Juan', 'Mary']
Hope this will help you! Thankyou! :)

Related

How to use the universal POS tags with nltk.pos_tag() function?

I have a text and I want to find number of 'ADJs','PRONs', 'VERBs', 'NOUNs' etc.
I know that there is .pos_tag() function but it gives me different results , and I want to have results as 'ADJ','PRON', 'VERB', 'NOUN'.
This is my code:
import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
sentence = "this is my sample text that I want to analyze with programming language"
# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()
# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()
# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)
for w_type, num in count_of_word_type_list:
print(w_type, num)
print()
The code above works but I want to find a way to get this type of tags:
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy
I saw that there is a chapter here: https://www.nltk.org/book/ch05.html
That says:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
But I do not know how to apply that on my sample sentence.
Thanks for your help.
From https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L135
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
# Default Penntreebank tagset.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
# Universal POS tags.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

Python TextBlob Package - Determines POS tag for '%' symbol but do not print it as a word

I was banging my head with the python's TextBlob package that
identifies sentences from paragraphs
identifies words from sentences
determines POS(Part of Speech) tags for those words, etc...
Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.
from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample) #Passing it to TextBlob package.
Words = blob.words #Splitting the Sentence into words.
Tags = blob.tags #Determining POS tag for each words in the sentence
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]
print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']
As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.
Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.
I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.
Here are my questions.
Is this possible issue in TextBlob package by any chance ?
And is there any way to identify '%' in the Words list ?
Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624
They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.
When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.
EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.
So you could easily pass TextBlob a tokenizer that respects punctuation.
respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.
Finally I have found out that NLTK is identifying the symbols properly. The code snippet for the same is given below for reference :
from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)
print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']
print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

Finding Proper Nouns using NLTK WordNet

Is there any way to find proper nouns using NLTK WordNet?Ie., Can i tag Possessive nouns using nltk Wordnet ?
I don't think you need WordNet to find proper nouns, I suggest using the Part-Of-Speech tagger pos_tag.
To find Proper Nouns, look for the NNP tag:
from nltk.tag import pos_tag
sentence = "Michael Jackson likes to eat at McDonalds"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('eat', 'VB'), ('at', 'IN'), ('McDonalds', 'NNP')]
propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
# ['Michael','Jackson', 'McDonalds']
You may not be very satisfied since Michael and Jackson is split into 2 tokens, then you might need something more complex such as Name Entity tagger.
By right, as documented by the penntreebank tagset, for possessive nouns, you can simply look for the POS tag, http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html. But often the tagger doesn't tag POS when it's an NNP.
To find Possessive Nouns, look for str.endswith("'s") or str.endswith("s'"):
from nltk.tag import pos_tag
sentence = "Michael Jackson took Daniel Jackson's hamburger and Agnes' fries"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('took', 'VBD'), ('Daniel', 'NNP'), ("Jackson's", 'NNP'), ('hamburger', 'NN'), ('and', 'CC'), ("Agnes'", 'NNP'), ('fries', 'NNS')]
possessives = [word for word in sentence if word.endswith("'s") or word.endswith("s'")]
# ["Jackson's", "Agnes'"]
Alternatively, you can use NLTK ne_chunk but it doesn't seem to do much other unless you are concerned about what kind of Proper Noun you get from the sentence:
>>> from nltk.tree import Tree; from nltk.chunk import ne_chunk
>>> [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), Tree('PERSON', [('Daniel', 'NNP')])]
>>> [i[0] for i in list(chain(*[chunk.leaves() for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]))]
['Michael', 'Jackson', 'Daniel']
Using ne_chunk is a little verbose and it doesn't get you the possessives.
I think what you need is a tagger, a part-of-speech tagger. This tool assigns a part-of-speech tag (e.g., proper noun, possesive pronoun etc) to each word in a sentence.
NLTK includes some taggers:
http://nltk.org/book/ch05.html
There's also the Stanford Part-Of-Speech Tagger (open source too, better performance).

Can't get a Counter() to work in python

I'm attempting to make a counter which uses a list of POS trigrams to check to a large list of trigrams and find their frequency.
My code so far is as follows:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
tags = nltk.pos_tag (tokens)
trigram_list=trigrams(tags)
list_tri=Counter (t for t in trigram_list if t in trigram_top)
print list_tri
I get an empty counter back. How do I mend this?
In an earlier version I did get data back, but it kept counting up for ever iteration (in the real program, text is a collection of different files).
Does anyone have an idea?
Let's put some print in there to debug:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
print tokens
tags = nltk.pos_tag (tokens)
print tags
list_tri=Counter (t[0] for t in tags if t[1] in trigram_top)
print list_tri
#['This', 'is', 'an', 'example', 'sentence', '.']
#[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('.', '.')]
#Counter()
Note that the list= part was redundant and I've changed the generator to just take the word instead of the pos tag
We can see that none of the pos tags directly match your trigram_top - you may want to amend your comparison check to cater for VB/VBZ...
A possibility would be changing the line:
list_tri=Counter (t[0] for t in tags if t[1].startswith(tuple(trigram_top)))
# Counter({'is': 1})

nltk custom tokenizer and tagger

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.
The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)
You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Categories

Resources