How to get synonyms from nltk WordNet Python

How to get synonyms from nltk WordNet Python - python

WordNet is great, but I'm having a hard time getting synonyms in nltk. If you search similar to for the word 'small' like here, it shows all of the synonyms.
Basically I just need to know the following:
wn.synsets('word')[i].option() Where option can be hypernyms and antonyms, but what is the option for getting synonyms?

If you want the synonyms in the synset (aka the lemmas that make up the set), you can get them with lemma_names():
>>> for ss in wn.synsets('small'):
>>> print(ss.name(), ss.lemma_names())
small.n.01 ['small']
small.n.02 ['small']
small.a.01 ['small', 'little']
minor.s.10 ['minor', 'modest', 'small', 'small-scale', 'pocket-size', 'pocket-sized']
little.s.03 ['little', 'small']
small.s.04 ['small']
humble.s.01 ['humble', 'low', 'lowly', 'modest', 'small']
...

You can use wordnet.synset and lemmas in order to get all the synonyms:
example :
from itertools import chain
from nltk.corpus import wordnet
synonyms = wordnet.synsets(text)
lemmas = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
Demo:
>>> synonyms = wordnet.synsets('change')
>>> set(chain.from_iterable([word.lemma_names() for word in synonyms]))
set([u'interchange', u'convert', u'variety', u'vary', u'exchange', u'modify', u'alteration', u'switch', u'commute', u'shift', u'modification', u'deepen', u'transfer', u'alter', u'change'])

You might be interested in a Synset:
>>> wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
That's the same list of top-level entries that the web interface gave you.
If you also want the "similar to" list, that's not the same thing as the synonyms. For that, you call similar_tos() on each Synset.
So, to show the same information as the website, start with something like this:
for ss in wn.synsets('small'):
print(ss)
for sim in ss.similar_tos():
print(' {}'.format(sim))
Of course the website is also printing the part of speech (sim.pos), list of lemmas (sim.lemma_names), definition (sim.definition), and examples (sim.examples) for each synset at both levels. and it's grouping them by parts of speech, and it's added in links to other things that you can follow, and so forth. But that should be enough to get you started.

Simplest program to print the synonyms of a given word
from nltk.corpus import wordnet
for syn in wordnet.synsets("good"):
for name in syn.lemma_names():
print(name)

Here are some helper functions to make NLTK easier to use, and two examples of how those functions can be used.
def download_nltk_dependencies_if_needed():
try:
nltk.word_tokenize('foobar')
except LookupError:
nltk.download('punkt')
try:
nltk.pos_tag(nltk.word_tokenize('foobar'))
except LookupError:
nltk.download('averaged_perceptron_tagger')
def get_some_word_synonyms(word):
word = word.lower()
synonyms = []
synsets = wordnet.synsets(word)
if (len(synsets) == 0):
return []
synset = synsets[0]
lemma_names = synset.lemma_names()
for lemma_name in lemma_names:
lemma_name = lemma_name.lower().replace('_', ' ')
if (lemma_name != word and lemma_name not in synonyms):
synonyms.append(lemma_name)
return synonyms
def get_all_word_synonyms(word):
word = word.lower()
synonyms = []
synsets = wordnet.synsets(word)
if (len(synsets) == 0):
return []
for synset in synsets:
lemma_names = synset.lemma_names()
for lemma_name in lemma_names:
lemma_name = lemma_name.lower().replace('_', ' ')
if (lemma_name != word and lemma_name not in synonyms):
synonyms.append(lemma_name)
return synonyms
Example 1: get_some_word_synonyms
This approach tends to return the most relevant synonyms, but some words like "angry" won't return any synonyms.
download_nltk_dependencies_if_needed()
words = ['dog', 'fire', 'erupted', 'throw', 'sweet', 'center', 'said', 'angry', 'iPhone', 'ThisIsNotARealWorddd', 'awesome', 'amazing', 'jim dandy', 'change']
for word in words:
print('Synonyms for {}:'.format(word))
synonyms = get_some_word_synonyms(word)
for synonym in synonyms:
print(" {}".format(synonym))
Example 1 output:
Synonyms for dog:
domestic dog
canis familiaris
Synonyms for fire:
Synonyms for erupted:
erupt
break out
Synonyms for throw:
Synonyms for sweet:
henry sweet
Synonyms for center:
centre
middle
heart
eye
Synonyms for said:
state
say
tell
Synonyms for angry:
Synonyms for iPhone:
Synonyms for ThisIsNotARealWorddd:
Synonyms for awesome:
amazing
awe-inspiring
awful
awing
Synonyms for amazing:
amaze
astonish
astound
Synonyms for jim dandy:
Synonyms for change:
alteration
modification
Example 2: get_all_word_synonyms
This approach will return all possible synonyms, but some may not be very relevant.
download_nltk_dependencies_if_needed()
words = ['dog', 'fire', 'erupted', 'throw', 'sweet', 'center', 'said', 'angry', 'iPhone', 'ThisIsNotARealWorddd', 'awesome', 'amazing', 'jim dandy', 'change']
for word in words:
print('Synonyms for {}:'.format(word))
synonyms = get_some_word_synonyms(word)
for synonym in synonyms:
print(" {}".format(synonym))
Example 2 output:
Synonyms for dog:
domestic dog
canis familiaris
frump
cad
bounder
blackguard
hound
heel
frank
frankfurter
hotdog
hot dog
wiener
wienerwurst
weenie
pawl
detent
click
andiron
firedog
dog-iron
chase
chase after
trail
tail
tag
give chase
go after
track
Synonyms for fire:
firing
flame
flaming
ardor
ardour
fervor
fervour
fervency
fervidness
attack
flak
flack
blast
open fire
discharge
displace
give notice
can
dismiss
give the axe
send away
sack
force out
give the sack
terminate
go off
arouse
elicit
enkindle
kindle
evoke
raise
provoke
burn
burn down
fuel
Synonyms for erupted:
erupt
break out
irrupt
flare up
flare
break open
burst out
ignite
catch fire
take fire
combust
conflagrate
come out
break through
push through
belch
extravasate
break
burst
recrudesce
Synonyms for throw:
stroke
cam stroke
shed
cast
cast off
shake off
throw off
throw away
drop
thrust
give
flip
switch
project
contrive
bewilder
bemuse
discombobulate
hurl
hold
have
make
confuse
fox
befuddle
fuddle
bedevil
confound
Synonyms for sweet:
henry sweet
dessert
afters
confection
sweetness
sugariness
angelic
angelical
cherubic
seraphic
dulcet
honeyed
mellifluous
mellisonant
gratifying
odoriferous
odorous
perfumed
scented
sweet-scented
sweet-smelling
fresh
unfermented
sugared
sweetened
sweet-flavored
sweetly
Synonyms for center:
centre
middle
heart
eye
center field
centerfield
midpoint
kernel
substance
core
essence
gist
heart and soul
inwardness
marrow
meat
nub
pith
sum
nitty-gritty
center of attention
centre of attention
nerve center
nerve centre
snapper
plaza
mall
shopping mall
shopping center
shopping centre
focus on
center on
revolve around
revolve about
concentrate on
concentrate
focus
pore
rivet
halfway
midway
Synonyms for said:
state
say
tell
allege
aver
suppose
read
order
enjoin
pronounce
articulate
enounce
sound out
enunciate
aforesaid
aforementioned
Synonyms for angry:
furious
raging
tempestuous
wild
Synonyms for iPhone:
Synonyms for ThisIsNotARealWorddd:
Synonyms for awesome:
amazing
awe-inspiring
awful
awing
Synonyms for amazing:
amaze
astonish
astound
perplex
vex
stick
get
puzzle
mystify
baffle
beat
pose
bewilder
flummox
stupefy
nonplus
gravel
dumbfound
astonishing
awe-inspiring
awesome
awful
awing
Synonyms for jim dandy:
Synonyms for change:
alteration
modification
variety
alter
modify
vary
switch
shift
exchange
commute
convert
interchange
transfer
deepen

This worked for me
wordnet.synsets('change')[0].hypernyms()[0].lemma_names()

I've code Thesaurus Lookup for Synonym recently, I used this function :
def find_synonyms(keyword) :
synonyms = []
for synset in wordnet.synsets(keyword):
for lemma in synset.lemmas():
synonyms.append(lemma.name())
return str(synonyms)
But if you prefer to host your own Dictionary, you might interested with my project on offline synonym dictionary lookup on my github page :
https://github.com/syauqiex/offline_english_synonym_dictionary

Perhaps these are not synonyms in the proper terminology of wordnet. But I also want my function to return all similar words, like 'weeny', 'flyspeck' etc. You can see them for the word 'small' in the author link. I used these code:
from nltk.corpus import wordnet as wn
def get_all_synonyms(word):
synonyms = []
for ss in wn.synsets(word):
synonyms.extend(ss.lemma_names())
for sim in ss.similar_tos():
synonyms_batch = sim.lemma_names()
synonyms.extend(synonyms_batch)
synonyms = set(synonyms)
if word in synonyms:
synonyms.remove(word)
synonyms = [synonym.replace('_',' ') for synonym in synonyms]
return synonyms
get_all_synonyms('small')

Related

Extracting sentences containing a keyword using set()

I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]

If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.

Using WordNet with nltk to find synonyms that make sense

I want to input a sentence, and output a sentence with hard words made simpler.
I'm using Nltk to tokenize sentences and tag words, but I'm having trouble using WordNet to find a synonym for the specific meaning of a word that I want.
For example:
Input:
"I refuse to pick up the refuse"
Maybe refuse #1 is the easiest word for rejecting, but the refuse #2 means garbage, and there are simpler words that could go there.
Nltk might be able to tag refuse #2 as a noun, but then how do I get synonyms for refuse (trash) from WordNet?

Sounds like you want word synonyms based upon the part of speech of the word (i.e. noun, verb, etc.)
Follows creates synonyms for each word in a sentence based upon part of speech.
References:
Extract Word from Synset using Wordnet in NLTK 3.0
Printing the part of speech along with the synonyms of the word
Code
import nltk; nltk.download('popular')
from nltk.corpus import wordnet as wn
def get_synonyms(word, pos):
' Gets word synonyms for part of speech '
for synset in wn.synsets(word, pos=pos_to_wordnet_pos(pos)):
for lemma in synset.lemmas():
yield lemma.name()
def pos_to_wordnet_pos(penntag, returnNone=False):
' Mapping from POS tag word wordnet pos tag '
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
return None if returnNone else ''
Example Usage
# Tokenize text
text = nltk.word_tokenize("I refuse to pick up the refuse")
for word, tag in nltk.pos_tag(text):
print(f'word is {word}, POS is {tag}')
# Filter for unique synonyms not equal to word and sort.
unique = sorted(set(synonym for synonym in get_synonyms(word, tag) if synonym != word))
for synonym in unique:
print('\t', synonym)
Output
Note the different sets of synonyms for refuse based upon POS.
word is I, POS is PRP
word is refuse, POS is VBP
decline
defy
deny
pass_up
reject
resist
turn_away
turn_down
word is to, POS is TO
word is pick, POS is VB
beak
blame
break_up
clean
cull
find_fault
foot
nibble
peck
piece
pluck
plunk
word is up, POS is RP
word is the, POS is DT
word is refuse, POS is NN
food_waste
garbage
scraps

Modifying corpus by inserting codewords using Python

I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:
This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad
smell from the ice maker
The food was awesome, but the water was very rude
I want to change these texts to the following:
This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable
POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE
I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:
amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind
And, a text file contains such negative words as:
rude
poor
worst
dirty
slow
bad
So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.
Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.
Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)

A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.
Example:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().

If I understood correctly, you need something like:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
Is it OK with you?

RegEx: How to find all instance of a collocation?

I am trying to write a script in python to find word collocations in a text. A word collocation is a pair of words that co-occur often in various texts. For example in the collocation "lemon zest", the words lemon and zest co-occur often and thus it is a collocation. Now I want to use re.findall to find all occurences of a given collocation. Unlike "lemon zest", there are some collocations that wouldn't be next to each other in texts. For example, in the phrase "kind of funny", because "of" is stop word, it would be already removed. So given the collocation "kind funny", a program has to return "kind of funny" as output.
Can anybody tell me how to do this? I should mention that I require a scalable approcah as I am dealing with gigabytes of text
Edit1:
inputCollocation = "kind funny"
Document1 = "This film is kind of funny"
Document2 = "It is kind of funny"
Document3 = "That film is funny"
ExpectedOutput: Document1, Document2
Thank you in advance.

You can just use string comparison:
inputCollocation = "kind funny"
documents = dict(
Document1 = "This film kind funny",
Document2 = "It kind funny",
Document3 = "That film funny",
)
def remove_stopwords(text):
...
matching = [
document for (document, text) in documents.iteritems()
if inputCollocation in remove_stopwords(text.lower())
]
print 'ExpectedOutput:', ', '.join(matching)
You could also consider using NLTK which has tools for finding collocations.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.