Get best synonym for words in sentences using wordnet - python

I have done code to get synonyms from wordnet, and it is providing complete list of synonym for each word.
So, I want my code to select appropriate synonym from synonym list based on sentence.
For example:
Sentence is: "I am his older brother" and I have to find out best synonym for each word based on this sentence.
Lets select "older". Wordnet would give synonym list for "older":
['elder', 'onetime', 'former', 'sr.', 'one-time', 'erstwhile', 'honest-to-god', 'aged', 'Old', 'previous', 'sure-enough', 'older', 'senior', 'old', 'sometime', 'honest-to-goodness', 'quondam', 'elderly']
From the list best synonym based on this sentence is 'elder', so it should be selected.
How can I do this?
Code to get synonyms:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn
def tag(sentence):
words = word_tokenize(sentence)
words = pos_tag(words)
return words
def paraphraseable(tag):
return tag.startswith('NN') or tag == 'VB' or tag.startswith('JJ')
def pos(tag):
if tag.startswith('NN'):
return wn.NOUN
elif tag.startswith('V'):
return wn.VERB
def synonyms(word, tag):
lemma_lists = [ss.lemmas() for ss in wn.synsets(word, pos(tag))]
lemmas = [lemma.name() for lemma in sum(lemma_lists, [])]
return set(lemmas)
def synonymIfExists(sentence):
for (word, t) in tag(sentence):
if paraphraseable(t):
syns = synonyms(word, t)
if syns:
if len(syns) > 1:
yield [word, list(syns)]
continue
yield [word, []]
def paraphrase(sentence):
return [x for x in synonymIfExists(sentence)]
get=[]
get=paraphrase("I am his older brother")
print("paraphrase",get)

Synonyms in synsets are listed irrespective of their frequency of occurrence in natural language and in a given context.
To explore both of these missing areas more I would go for an bi-gram predictive model and check what words from the synset appear as next to the left context of the utterance you want to substitute it in. Similarly, you could explore right context as well and/or longer contexts.
Another (easier) approach would be to introduce frequency order to the WordNet based on word frequencies from a large enough corpus. Assumption would be that frequency of appearance in the corpus is a correct hint for perceived suitability of a synonym.

Related

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']

Using WordNet with nltk to find synonyms that make sense

I want to input a sentence, and output a sentence with hard words made simpler.
I'm using Nltk to tokenize sentences and tag words, but I'm having trouble using WordNet to find a synonym for the specific meaning of a word that I want.
For example:
Input:
"I refuse to pick up the refuse"
Maybe refuse #1 is the easiest word for rejecting, but the refuse #2 means garbage, and there are simpler words that could go there.
Nltk might be able to tag refuse #2 as a noun, but then how do I get synonyms for refuse (trash) from WordNet?
Sounds like you want word synonyms based upon the part of speech of the word (i.e. noun, verb, etc.)
Follows creates synonyms for each word in a sentence based upon part of speech.
References:
Extract Word from Synset using Wordnet in NLTK 3.0
Printing the part of speech along with the synonyms of the word
Code
import nltk; nltk.download('popular')
from nltk.corpus import wordnet as wn
def get_synonyms(word, pos):
' Gets word synonyms for part of speech '
for synset in wn.synsets(word, pos=pos_to_wordnet_pos(pos)):
for lemma in synset.lemmas():
yield lemma.name()
def pos_to_wordnet_pos(penntag, returnNone=False):
' Mapping from POS tag word wordnet pos tag '
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
return None if returnNone else ''
Example Usage
# Tokenize text
text = nltk.word_tokenize("I refuse to pick up the refuse")
for word, tag in nltk.pos_tag(text):
print(f'word is {word}, POS is {tag}')
# Filter for unique synonyms not equal to word and sort.
unique = sorted(set(synonym for synonym in get_synonyms(word, tag) if synonym != word))
for synonym in unique:
print('\t', synonym)
Output
Note the different sets of synonyms for refuse based upon POS.
word is I, POS is PRP
word is refuse, POS is VBP
decline
defy
deny
pass_up
reject
resist
turn_away
turn_down
word is to, POS is TO
word is pick, POS is VB
beak
blame
break_up
clean
cull
find_fault
foot
nibble
peck
piece
pluck
plunk
word is up, POS is RP
word is the, POS is DT
word is refuse, POS is NN
food_waste
garbage
scraps

How to find bi-grams which include pre-defined words?

I know it is possible to find bigrams which have a particular word from the example in the link below:
finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)
bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
>>>
nltk: how to get bigrams containing a specific word
But I am not sure how this can be applied if I need bigrams containing both words pre-defined.
Example:
My Sentence: "hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"
Given a list:["yesterday", "other", "I", "side"]
How can I get a list of bi-grams with the given words. i.e:
[("yesterday", "I"), ("other", "side")]?
What you want is probably a word_filter function that returns False only if all the words in a particular bigram are part of the list
def word_filter(x, y):
if x in lst and y in lst:
return False
return True
where lst = ["yesterday", "I", "other", "side"]
Note that this function is accessing the lst from the outer scope - which is a dangerous thing, so make sure you don't make any changes to lst within the word_filter function
First you can create all possible bigrams for your vocabulary and feed that as the input for a countVectorizer, which can transform your given text into bigram counts.
Then, you filter the generated bigrams based on the counts given by countVectorizer.
Note: I have changed the token pattern to account for even single character. By default, it skips the single characters.
from sklearn.feature_extraction.text import CountVectorizer
import itertools
corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count])
output:
['yesterday i', 'other side']
This approach would be a better approach when you have more number of documents and less number of words in the vocabulary. If its other way around, you can find all the bigrams in the document first and then filter it using your vocabulary.

Python and NLTK: Baseline tagger

I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default.
The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?
import nltk
from nltk.corpus import brown
def findtags(userinput, tagged_text):
uinput = userinput.split()
fdist = nltk.FreqDist(tagged_text)
result = []
for item in fdist.items():
for u in uinput:
if u==item[0][0]:
t = (u,item[0][1])
result.append(t)
continue
t = (u, "NN")
result.append(t)
return result
def main():
tags = findtags("the quick brown fox", brown.tagged_words())
print tags
if __name__ == '__main__':
main()
If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]
If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger
By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.
However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain
def type_token_ratio(documentstream):
ttr = defaultdict(list)
for token, pos in list(chain(*documentstream)):
ttr[token].append(pos)
return ttr
def most_freq_tag(ttr, word):
return Counter(ttr[word]).most_common()[0][0]
sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ."
documents = [sent1, sent2]
# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])
# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]
# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]
NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.
Create a dictionary called word_tags whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist.)
Then:
for u in uinput:
result.append(word_tags[u][0])
You can simply use Counter to find most repeated item in a list:
Python
from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]
If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes:
http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger
Anyways, I suggest you to read NLTK book chapter 5
specially:
http://nltk.org/book/ch05.html#the-lookup-tagger
Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
In this case cfd["fox"].max() will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:
likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())
Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.

recursive extract synonym for a new word from NLTK

Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.

Categories

Resources