Python and NLTK: Baseline tagger - python

I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default.
The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?
import nltk
from nltk.corpus import brown
def findtags(userinput, tagged_text):
uinput = userinput.split()
fdist = nltk.FreqDist(tagged_text)
result = []
for item in fdist.items():
for u in uinput:
if u==item[0][0]:
t = (u,item[0][1])
result.append(t)
continue
t = (u, "NN")
result.append(t)
return result
def main():
tags = findtags("the quick brown fox", brown.tagged_words())
print tags
if __name__ == '__main__':
main()

If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]
If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger
By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.
However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain
def type_token_ratio(documentstream):
ttr = defaultdict(list)
for token, pos in list(chain(*documentstream)):
ttr[token].append(pos)
return ttr
def most_freq_tag(ttr, word):
return Counter(ttr[word]).most_common()[0][0]
sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ."
documents = [sent1, sent2]
# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])
# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]
# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]
NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.

Create a dictionary called word_tags whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist.)
Then:
for u in uinput:
result.append(word_tags[u][0])

You can simply use Counter to find most repeated item in a list:
Python
from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]
If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes:
http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger
Anyways, I suggest you to read NLTK book chapter 5
specially:
http://nltk.org/book/ch05.html#the-lookup-tagger
Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
In this case cfd["fox"].max() will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:
likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())
Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.

Related

Using WordNet with nltk to find synonyms that make sense

I want to input a sentence, and output a sentence with hard words made simpler.
I'm using Nltk to tokenize sentences and tag words, but I'm having trouble using WordNet to find a synonym for the specific meaning of a word that I want.
For example:
Input:
"I refuse to pick up the refuse"
Maybe refuse #1 is the easiest word for rejecting, but the refuse #2 means garbage, and there are simpler words that could go there.
Nltk might be able to tag refuse #2 as a noun, but then how do I get synonyms for refuse (trash) from WordNet?
Sounds like you want word synonyms based upon the part of speech of the word (i.e. noun, verb, etc.)
Follows creates synonyms for each word in a sentence based upon part of speech.
References:
Extract Word from Synset using Wordnet in NLTK 3.0
Printing the part of speech along with the synonyms of the word
Code
import nltk; nltk.download('popular')
from nltk.corpus import wordnet as wn
def get_synonyms(word, pos):
' Gets word synonyms for part of speech '
for synset in wn.synsets(word, pos=pos_to_wordnet_pos(pos)):
for lemma in synset.lemmas():
yield lemma.name()
def pos_to_wordnet_pos(penntag, returnNone=False):
' Mapping from POS tag word wordnet pos tag '
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
return None if returnNone else ''
Example Usage
# Tokenize text
text = nltk.word_tokenize("I refuse to pick up the refuse")
for word, tag in nltk.pos_tag(text):
print(f'word is {word}, POS is {tag}')
# Filter for unique synonyms not equal to word and sort.
unique = sorted(set(synonym for synonym in get_synonyms(word, tag) if synonym != word))
for synonym in unique:
print('\t', synonym)
Output
Note the different sets of synonyms for refuse based upon POS.
word is I, POS is PRP
word is refuse, POS is VBP
decline
defy
deny
pass_up
reject
resist
turn_away
turn_down
word is to, POS is TO
word is pick, POS is VB
beak
blame
break_up
clean
cull
find_fault
foot
nibble
peck
piece
pluck
plunk
word is up, POS is RP
word is the, POS is DT
word is refuse, POS is NN
food_waste
garbage
scraps

Generate bigrams BUT only noun and verb combinations

I have some code below that generates bigrams for my data frame column.
import nltk
import collections
counts = collections.Counter()
for sent in df["message"]:
words = nltk.word_tokenize(sent)
counts.update(nltk.bigrams(words))
counts = {k: v for k, v in counts.items() if v > 25}
This works great for generating my most common bigrams in the 'message' column of my dataframe, BUT, I want to get bigrams that contain one verb and one noun per pair of bigrams only.
Any help doing this with spaCy or nltk would be appreciated!
With spaCy, you have access to pre-trained models in various languages. You can install them like so: python -m spacy download en_core_web_sm
Then, you can easily run something like this to do custom filtering:
import spacy
text = "The sleeping cat thought that sitting in the couch resting would be a great idea."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i in range(len(doc)):
j = i+1
if j < len(doc):
if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN"):
print(doc[i].text, doc[j].text, doc[i].pos_, doc[j].pos_)
which will output
sleeping cat VERB NOUN
cat thought NOUN VERB
couch resting NOUN VERB
You have to first apply pos_tag and then bigrams
You can try like this
import nltk
sent = 'The thieves stole the paintings'
token_sent = nltk.word_tokenize(sent)
tagged_sent = nltk.pos_tag(token_sent)
word_tag_pairs = nltk.bigrams(tagged_sent)
##Apply conditions according to your requirement to filter the bigrams
print([(a,b) for a, b in word_tag_pairs if a[1].startswith('N') and b[1].startswith('V')])
It just gives an output of
[(('thieves', 'NNS'), ('stole', 'VBD'))]

Use Python to print sentences belonging to most common words in a document

I have a text document, I am using regex and nltk to find top 5 most common words from this document. I have to print out sentences where these words belong to, how do I do that? further, I want to extend this to finding common words in multiple documents and returning their respective sentences.
import nltk
import collections
from collections import Counter
import re
import string
frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15]
fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution from a list
most_common = fdist.max() # returns a single element
top_five = fdist.most_common(5)# returns a list
list_5=[word for (word, freq) in fdist.most_common(5)]
print(top_five)
print(list_5)
Output:
[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)]
['you', 'tuples', 'the', 'are', 'pard']
The output is most commonly occurring words I have to print sentences where these words belong to, how do I do that?
Although it doesn't account for special characters at word boundaries like your code does, the following would be a starting point:
for sentence in text_string.split('.'):
if list(set(list_5) & set(sentence.split(' '))):
print sentence
We first iterate over the sentences, assuming each sentence ends with a . and the . character is nowhere else in the text. Afterwards, we print the sentence if the intersection of its set of words with the set of words in your list_5 is not empty.
You will have to install NLTK Data if you haven't already done so.
From http://www.nltk.org/data.html :
Run the Python interpreter and type the commands:
> >>> import nltk
> >>> nltk.download()
A new window should open, showing the NLTK Downloader. Click on the
File menu and select Change Download
Directory.
Then install the punkt model from the models tab.
Once you have that you can tokenize all sentences and extract the ones with your top 5 words in them as such:
sent_tokenize_list = nltk.sent_tokenize(text_string)
for sentence in sent_tokenize_list:
for word in list_5:
if word in sentence:
print(sentence)

How to check two POS tags are in the same category in NLTK?

Like the title says, how can I check two POS tags are in the same category?
For example,
go -> VB
goes -> VBZ
These two words are both verbs. Or,
bag -> NN
bags -> NNS
These two are both nouns.
So my question is that whether there exists any function in NLTK to check if two given tags are in the same category?
Let's take the simple case first: Your corpus is tagged with the Brown tagset (that's what it looks like), and you'd be happy with the simple tags defined in the nltk's "universal" tagset: ., ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, PRON, PRT, VERB, X, where the dot stands for "punctuation". In this case, simply load the nltk's map and use it with your data:
tagmap = nltk.tag.mapping.tagset_mapping("en-brown", "universal")
if tagmap[tag1] == tagmap[tag2]:
print("The two words have the same part of speech")
If that's not your use case, you'll need to manually decide on a mapping from each individual tag to the simplified category you want to assign it to. If you are working with the Brown corpus tagset, you can see the tags and their meanings here, or from within python like this:
print(nltk.help.brown_tagset())
Study your tags and define a dictionary that maps each POS tag to your chosen category; people sometimes find it useful to just group Brown corpus tags by their first two letters, putting together "NN", "NN$", "NNS-HL", etc. You could create this particular mapping automatically like this:
from nltk.corpus import brown
alltags = set(t for w, t in brown.tagged_words())
tagmap = dict(t[:2] for t in alltags)
Then you can customize this map according to your needs; e.g., to put all punctuation tags together in the category ".":
for tag in tagmap:
if not tag.isalpha():
tagmap[tag] = "."
Once your tagmap is to your liking, use it like the one I imported from the nltk.
Finally, you might find it convenient to retag your entire corpus in one go, so that you can simply compare the assigned tags. If corpus is a list of tagged sentences in the format of the nltk's <corpus>.tagged_sents() command (so not a corpus reader object), you can retag everything like this:
newcorpus = []
for sent in corpus:
newcorpus.append( [ (w, tagmap[t]) for w, t in sent ] )
Not sure if this is what you are looking for, but you can tag with a universal tagset:
from pprint import pprint
from collections import defaultdict
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
s = "I go. He goes. This bag is brown. These bags are brown."
d = defaultdict(list)
for sent in sent_tokenize(s):
text = word_tokenize(sent)
for value, tag in pos_tag(text, tagset='universal'):
d[tag].append(value)
pprint(dict(d))
Prints:
{'.': ['.', '.', '.', '.'],
'ADJ': ['brown'],
'DET': ['This', 'These'],
'NOUN': ['bag', 'bags'],
'PRON': ['I', 'He'],
'VERB': ['go', 'goes', 'is', 'brown', 'are']}
Note how bag and bags fall into NOUN category and go and goes fall into VERB.

How compare wordnet Synsets with another word?

I need to check if some word its sysnset of another words..
for example :
cat and dog ..
first i need to find synsets of cat by this code:
list= wn.synsets('cat')
then the list of synsets are returned:
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')
So, now I need to check if dog in this list ???
How can I do it by nltk Python code?
from nltk.corpus import wordnet as wn
for s in wn.synsets('cat'):
lemmas = s.lemmas()
for l in lemmas:
if l.name() == 'dog':
print l.synset()
Notice that this code searches a joint synset of two words which are considered to be synonyms (so nothing will be found with your 'cat' and 'dog' example). However, there are also other relations in wordnet. For instance, you can search for a 'cat' synset that contains 'dog' as antonym.

Categories

Resources