I would like to create a set of alternative words for a word. The alternative word has to be suitably different so that replacing 'dog' with 'dalmatian' is too similar- I would want to replace 'dog' with 'cat'. Although not infallible, I think I can do this by getting the hypernym for a word and ten that hypernym's hypernym (Ie that grandparent synset) and finally getting all the grandchildren words for that grandparent.
Hopefully this makes sense. In pseudocode it should read
for each i as hypernym (synset)
for each j as i.hypernym
get all the holonyms for j as s
for each s get all the holonyms as x
print x
Is this doable?
from itertools import chain
from collections import defaultdict
from nltk.corpus import wordnet as wn
gflemma_holonym = defaultdict(set)
for ss in wn.all_synsets():
if ss.part_holonyms() and ss.hypernyms() and ss.hypernyms()[0].hypernyms():
grandfather = ss.hypernyms()[0].hypernyms()[0] # grandfather concept.
holonyms = list(chain(*[i.lemma_names() for i in ss.part_holonyms()]))
for lemma in grandfather.lemma_names():
gflemma_holonym[lemma].update(holonyms)
print gflemma_holonym[u'edible_nut']
print
print gflemma_holonym[u'geographical_area']
[out]:
set([u'black_hickory', u'black_walnut', u'Juglans_nigra', u'black_walnut_tree'])
set([u'battlefield', u'fair', u'infield', u'field_of_honor', u'field_of_battle', u'battleground', u'city', u'bowl', u'field', u'stadium', u'funfair', u'outfield', u'diamond', u'urban_area', u'populated_area', u'desert', u'arena', u'carnival', u'baseball_diamond', u'sports_stadium', u'ball_field', u'baseball_field'])
Please note that wordnet inventory is limited. Especially when you are looking for relations of concepts/lemma that is so far apart (i.e. from synset's grandfather to synset's holonym)
You can use ether lists or dictionary to do this ( dictionary is more pythonic ).
for exemple with dictionnary you have something like this :
dictionnary={"dog": {"dalmatian","stuff"}, "singer": {"rihanna","eminem"}, "country": {"United states","England"}}
print(dictionnary['dog'])
Related
I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))
returning:
political = political
politics = politics
Where I want to return:
political = politics
politics = politics
Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:
from itertools import chain
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
[out]:
{'political'}
{'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
Maybe there are other tools out there that can do that, but I'll try this as a first.
First, stem all lemma names and group the lemmas with the same stem:
from collections import defaultdict
from wn import WordNet
from nltk.stem import PorterStemmer
porter = PorterStemmer()
wn = WordNet()
x = defaultdict(set)
i = 0
for lemma_name in wn.all_lemma_names():
if lemma_name:
x[porter.stem(lemma_name)].add(lemma_name)
i += 1
Note: pip install -U wn
Then as a sanity check, we check that the no. of lemmas > no. of groups:
print(len(x.keys()), i)
[out]:
(128442, 147306)
Then we can take a look at the groupings:
for k in sorted(x):
if len(x[k]) > 1:
print(k, x[k])
It seems to do what we need to group the words together with their "root word", e.g.
poke {'poke', 'poking'}
polar {'polarize', 'polarity', 'polarization', 'polar'}
polaris {'polarisation', 'polarise'}
pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
poleax {'poleaxe', 'poleax'}
polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
police_st {'police_state', 'police_station'}
polish {'polished', 'polisher', 'polish', 'polishing'}
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
poll {'poll', 'polls'}
But if we look closer there is some confusion:
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
So I would suggest the next step is
to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)
Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)
Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).
Finally using the resource of groups of words you've created, you can not "find the root word.
Using ngram in Python my aim is to find out verbs and their corresponding adverbs from an input text.
What I have done:
Input text:""He is talking weirdly. A horse can run fast. A big tree is there. The sun is beautiful. The place is well decorated.They are talking weirdly. She runs fast. She is talking greatly.Jack runs slow.""
Code:-
`finder2 = BigramCollocationFinder.from_words(wrd for (wrd,tags) in posTagged if tags in('VBG','RB','VBN',))
scored = finder2.score_ngrams(bigram_measures.raw_freq)
print sorted(finder2.nbest(bigram_measures.raw_freq, 5))`
From my code, I got the output:
[('talking', 'greatly'), ('talking', 'weirdly'), ('weirdly', 'talking'),('runs','fast'),('runs','slow')]
which is the list of verbs and their corresponding adverbs.
What I am looking for:
I want to figure out verb and all corresponding adverbs from this. For example ('talking'- 'greatly','weirdly),('runs'-'fast','slow')etc.
You already have a list of all verb-adverb bigrams, so you're just asking how to consolidate them into a dictionary that gives all adverbs for each verb. But first let's re-create your bigrams in a more direct way:
pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(posTagged):
if t1.startswith("VB") and t2 == "RB":
pairs.append((w1, w2))
Now for your question: We'll build a dictionary with the adverbs that follow each verb. I'll store the adverbs in a set, not a list, to get rid of duplications.
from collections import defaultdict
consolidated = defaultdict(set)
for verb, adverb in pairs:
consolidated[verb].add(adverb)
The defaultdict provides an empty set for verbs that haven't been seen before, so we don't need to check by hand.
Depending on the details of your assignment, you might also want to case-fold and lemmatize your verbs so that the adverbs from "Driving recklessly" and "I drove carefully" are recorded together:
wnl = nltk.stem.WordNetLemmatizer()
...
for verb, adverb in pairs:
verb = wnl.lemmatize(verb.lower(), "v")
consolidated[verb].add(adverb)
I think you are losing information you will need for this. You need to retain the part-of-speech data somehow, so that bigrams like ('weirdly', 'talking') can be processed in the correct manner.
It may be that the bigram finder can accept the tagged word tuples (I'm not familiar with nltk). Or, you may have to resort to creating an external index. If so, something like this might work:
part_of_speech = {word:tag for word,tag in posTagged}
best_bigrams = finger2.nbest(... as you like it ...)
verb_first_bigrams = [b if part_of_speech[b[1]] == 'RB' else (b[1],b[0]) for b in best_bigrams]
Then, with the verbs in front, you can transform it into a dictionary or list-of-lists or whatever:
adverbs_for = {}
for verb,adverb in verb_first_bigrams:
if verb not in adverbs_for:
adverbs_for[verb] = [adverb]
else:
adverbs_for[verb].append(adverb)
I am trying to get all the synonyms or similar words using nltk's wordnet but it is not returning.
I am doing:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('swap')
[Synset('barter.n.01'), Synset('trade.v.04'), Synset('swap.v.02')]
I also tried doing (from one of the stackoverflow page):
>>> for ss in wn.synsets('swap'):
for sim in ss.similar_tos():
print(' {}'.format(sim))
But I am not getting all the synonyms. I do not want to add synonyms to the wordnet.
I am expecting it to return exchange,interchange, substitute etc.
How to achieve this?
Thanks
Abhi
To get synonyms using wordnet, simply do this:
>>> from nltk.corpus import wordnet as wn
>>> for synset in wn.synsets('swap'):
for lemma in synset.lemmas():
print lemma.name(),
barter swap swop trade trade swap swop switch swap # note the overlap between the synsets
To obtain some of the words you mentioned, you may have to include hypernyms as well:
>>> for synset in wn.synsets('swap'):
for hypernym in synset.hypernyms():
for ss in hypernym.lemmas(): # now you need to iterate through each synset returned by synset.hypernyms()
print ss.name(),
exchange interchange exchange change interchange travel go move locomote # again, some overlap
Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively
Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).
I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?
You might want to look at this example:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'
(from this SO answer) to see if it sends you in the right direction.
First extract all the possible candidates from wordnet synsets.
Then use difflib to compare the strings against your target stem.
>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]
A more human readable way to compute the candidates is as such:
candidates = set()
for ss in wn.all_synsets('n'):
for ln in ss.lemma_names: # get all possible lemmas for this synset.
for lemma in ln:
if target in lemma:
candidates.add(target)
Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.