I need to check if some word its sysnset of another words..
for example :
cat and dog ..
first i need to find synsets of cat by this code:
list= wn.synsets('cat')
then the list of synsets are returned:
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')
So, now I need to check if dog in this list ???
How can I do it by nltk Python code?
from nltk.corpus import wordnet as wn
for s in wn.synsets('cat'):
lemmas = s.lemmas()
for l in lemmas:
if l.name() == 'dog':
print l.synset()
Notice that this code searches a joint synset of two words which are considered to be synonyms (so nothing will be found with your 'cat' and 'dog' example). However, there are also other relations in wordnet. For instance, you can search for a 'cat' synset that contains 'dog' as antonym.
Related
I want to input a sentence, and output a sentence with hard words made simpler.
I'm using Nltk to tokenize sentences and tag words, but I'm having trouble using WordNet to find a synonym for the specific meaning of a word that I want.
For example:
Input:
"I refuse to pick up the refuse"
Maybe refuse #1 is the easiest word for rejecting, but the refuse #2 means garbage, and there are simpler words that could go there.
Nltk might be able to tag refuse #2 as a noun, but then how do I get synonyms for refuse (trash) from WordNet?
Sounds like you want word synonyms based upon the part of speech of the word (i.e. noun, verb, etc.)
Follows creates synonyms for each word in a sentence based upon part of speech.
References:
Extract Word from Synset using Wordnet in NLTK 3.0
Printing the part of speech along with the synonyms of the word
Code
import nltk; nltk.download('popular')
from nltk.corpus import wordnet as wn
def get_synonyms(word, pos):
' Gets word synonyms for part of speech '
for synset in wn.synsets(word, pos=pos_to_wordnet_pos(pos)):
for lemma in synset.lemmas():
yield lemma.name()
def pos_to_wordnet_pos(penntag, returnNone=False):
' Mapping from POS tag word wordnet pos tag '
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
return None if returnNone else ''
Example Usage
# Tokenize text
text = nltk.word_tokenize("I refuse to pick up the refuse")
for word, tag in nltk.pos_tag(text):
print(f'word is {word}, POS is {tag}')
# Filter for unique synonyms not equal to word and sort.
unique = sorted(set(synonym for synonym in get_synonyms(word, tag) if synonym != word))
for synonym in unique:
print('\t', synonym)
Output
Note the different sets of synonyms for refuse based upon POS.
word is I, POS is PRP
word is refuse, POS is VBP
decline
defy
deny
pass_up
reject
resist
turn_away
turn_down
word is to, POS is TO
word is pick, POS is VB
beak
blame
break_up
clean
cull
find_fault
foot
nibble
peck
piece
pluck
plunk
word is up, POS is RP
word is the, POS is DT
word is refuse, POS is NN
food_waste
garbage
scraps
How to get similar words using wordnet, and not only the synonyms using synsets and their lemmas?
For example, if you search for "happy" on the wordnet online tool (http://wordnetweb.princeton.edu/). For the first synset there is only one synonym (happy) but if you click on it (on the S: link) you get additional words in "see also" and "similar to" words, like "cheerful".
How do I get these words and what are they called in wordnet terminology? I am using python with nltk and can only get the synsets and lemmas at best (excluding the hypernyms etc.)
"also_sees()" and "similar_tos()".
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets("happy")[0].also_sees()
[Synset('cheerful.a.01'), Synset('contented.a.01'), Synset('elated.a.01'), Synset('euphoric.a.01'), Synset('felicitous.a.01'), Synset('glad.a.01'), Synset('joyful.a.01'), Synset('joyous.a.01')]
>>> wn.synsets("happy")[0].similar_tos()
[Synset('blessed.s.06'), Synset('blissful.s.01'), Synset('bright.s.09'), Synset('golden.s.02'), Synset('laughing.s.01')]
If you want to see the full list of what a WordNet synset can do, try the "dir()" command. (It'll be full of objects you probably don't want, so I stripped out the underscored below.)
>>> [func for func in dir(wn.synsets("happy")[0]) if func[0] != "_"]
['acyclic_tree', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'mst', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'usage_domains', 'verb_groups', 'wup_similarity']
Like the title says, how can I check two POS tags are in the same category?
For example,
go -> VB
goes -> VBZ
These two words are both verbs. Or,
bag -> NN
bags -> NNS
These two are both nouns.
So my question is that whether there exists any function in NLTK to check if two given tags are in the same category?
Let's take the simple case first: Your corpus is tagged with the Brown tagset (that's what it looks like), and you'd be happy with the simple tags defined in the nltk's "universal" tagset: ., ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, PRON, PRT, VERB, X, where the dot stands for "punctuation". In this case, simply load the nltk's map and use it with your data:
tagmap = nltk.tag.mapping.tagset_mapping("en-brown", "universal")
if tagmap[tag1] == tagmap[tag2]:
print("The two words have the same part of speech")
If that's not your use case, you'll need to manually decide on a mapping from each individual tag to the simplified category you want to assign it to. If you are working with the Brown corpus tagset, you can see the tags and their meanings here, or from within python like this:
print(nltk.help.brown_tagset())
Study your tags and define a dictionary that maps each POS tag to your chosen category; people sometimes find it useful to just group Brown corpus tags by their first two letters, putting together "NN", "NN$", "NNS-HL", etc. You could create this particular mapping automatically like this:
from nltk.corpus import brown
alltags = set(t for w, t in brown.tagged_words())
tagmap = dict(t[:2] for t in alltags)
Then you can customize this map according to your needs; e.g., to put all punctuation tags together in the category ".":
for tag in tagmap:
if not tag.isalpha():
tagmap[tag] = "."
Once your tagmap is to your liking, use it like the one I imported from the nltk.
Finally, you might find it convenient to retag your entire corpus in one go, so that you can simply compare the assigned tags. If corpus is a list of tagged sentences in the format of the nltk's <corpus>.tagged_sents() command (so not a corpus reader object), you can retag everything like this:
newcorpus = []
for sent in corpus:
newcorpus.append( [ (w, tagmap[t]) for w, t in sent ] )
Not sure if this is what you are looking for, but you can tag with a universal tagset:
from pprint import pprint
from collections import defaultdict
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
s = "I go. He goes. This bag is brown. These bags are brown."
d = defaultdict(list)
for sent in sent_tokenize(s):
text = word_tokenize(sent)
for value, tag in pos_tag(text, tagset='universal'):
d[tag].append(value)
pprint(dict(d))
Prints:
{'.': ['.', '.', '.', '.'],
'ADJ': ['brown'],
'DET': ['This', 'These'],
'NOUN': ['bag', 'bags'],
'PRON': ['I', 'He'],
'VERB': ['go', 'goes', 'is', 'brown', 'are']}
Note how bag and bags fall into NOUN category and go and goes fall into VERB.
I am trying to get all the synonyms or similar words using nltk's wordnet but it is not returning.
I am doing:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('swap')
[Synset('barter.n.01'), Synset('trade.v.04'), Synset('swap.v.02')]
I also tried doing (from one of the stackoverflow page):
>>> for ss in wn.synsets('swap'):
for sim in ss.similar_tos():
print(' {}'.format(sim))
But I am not getting all the synonyms. I do not want to add synonyms to the wordnet.
I am expecting it to return exchange,interchange, substitute etc.
How to achieve this?
Thanks
Abhi
To get synonyms using wordnet, simply do this:
>>> from nltk.corpus import wordnet as wn
>>> for synset in wn.synsets('swap'):
for lemma in synset.lemmas():
print lemma.name(),
barter swap swop trade trade swap swop switch swap # note the overlap between the synsets
To obtain some of the words you mentioned, you may have to include hypernyms as well:
>>> for synset in wn.synsets('swap'):
for hypernym in synset.hypernyms():
for ss in hypernym.lemmas(): # now you need to iterate through each synset returned by synset.hypernyms()
print ss.name(),
exchange interchange exchange change interchange travel go move locomote # again, some overlap
I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default.
The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?
import nltk
from nltk.corpus import brown
def findtags(userinput, tagged_text):
uinput = userinput.split()
fdist = nltk.FreqDist(tagged_text)
result = []
for item in fdist.items():
for u in uinput:
if u==item[0][0]:
t = (u,item[0][1])
result.append(t)
continue
t = (u, "NN")
result.append(t)
return result
def main():
tags = findtags("the quick brown fox", brown.tagged_words())
print tags
if __name__ == '__main__':
main()
If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]
If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger
By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.
However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain
def type_token_ratio(documentstream):
ttr = defaultdict(list)
for token, pos in list(chain(*documentstream)):
ttr[token].append(pos)
return ttr
def most_freq_tag(ttr, word):
return Counter(ttr[word]).most_common()[0][0]
sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ."
documents = [sent1, sent2]
# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])
# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]
# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]
NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.
Create a dictionary called word_tags whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist.)
Then:
for u in uinput:
result.append(word_tags[u][0])
You can simply use Counter to find most repeated item in a list:
Python
from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]
If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes:
http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger
Anyways, I suggest you to read NLTK book chapter 5
specially:
http://nltk.org/book/ch05.html#the-lookup-tagger
Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
In this case cfd["fox"].max() will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:
likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())
Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.