Getting the closest noun from a stemmed word - python

Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively
Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).
I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?

You might want to look at this example:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'
(from this SO answer) to see if it sends you in the right direction.

First extract all the possible candidates from wordnet synsets.
Then use difflib to compare the strings against your target stem.
>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]
A more human readable way to compute the candidates is as such:
candidates = set()
for ss in wn.all_synsets('n'):
for ln in ss.lemma_names: # get all possible lemmas for this synset.
for lemma in ln:
if target in lemma:
candidates.add(target)

Related

python lemmatizer that lemmatize "political" and "politics" to the same word

I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))
returning:
political = political
politics = politics
Where I want to return:
political = politics
politics = politics
Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:
from itertools import chain
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
[out]:
{'political'}
{'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
Maybe there are other tools out there that can do that, but I'll try this as a first.
First, stem all lemma names and group the lemmas with the same stem:
from collections import defaultdict
from wn import WordNet
from nltk.stem import PorterStemmer
porter = PorterStemmer()
wn = WordNet()
x = defaultdict(set)
i = 0
for lemma_name in wn.all_lemma_names():
if lemma_name:
x[porter.stem(lemma_name)].add(lemma_name)
i += 1
Note: pip install -U wn
Then as a sanity check, we check that the no. of lemmas > no. of groups:
print(len(x.keys()), i)
[out]:
(128442, 147306)
Then we can take a look at the groupings:
for k in sorted(x):
if len(x[k]) > 1:
print(k, x[k])
It seems to do what we need to group the words together with their "root word", e.g.
poke {'poke', 'poking'}
polar {'polarize', 'polarity', 'polarization', 'polar'}
polaris {'polarisation', 'polarise'}
pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
poleax {'poleaxe', 'poleax'}
polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
police_st {'police_state', 'police_station'}
polish {'polished', 'polisher', 'polish', 'polishing'}
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
poll {'poll', 'polls'}
But if we look closer there is some confusion:
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
So I would suggest the next step is
to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)
Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)
Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).
Finally using the resource of groups of words you've created, you can not "find the root word.

Replacement by synsets in Python pattern packatge

My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']

Get all grandchildren holonym for a grandparent hypernym in wordnet

I would like to create a set of alternative words for a word. The alternative word has to be suitably different so that replacing 'dog' with 'dalmatian' is too similar- I would want to replace 'dog' with 'cat'. Although not infallible, I think I can do this by getting the hypernym for a word and ten that hypernym's hypernym (Ie that grandparent synset) and finally getting all the grandchildren words for that grandparent.
Hopefully this makes sense. In pseudocode it should read
for each i as hypernym (synset)
for each j as i.hypernym
get all the holonyms for j as s
for each s get all the holonyms as x
print x
Is this doable?
from itertools import chain
from collections import defaultdict
from nltk.corpus import wordnet as wn
gflemma_holonym = defaultdict(set)
for ss in wn.all_synsets():
if ss.part_holonyms() and ss.hypernyms() and ss.hypernyms()[0].hypernyms():
grandfather = ss.hypernyms()[0].hypernyms()[0] # grandfather concept.
holonyms = list(chain(*[i.lemma_names() for i in ss.part_holonyms()]))
for lemma in grandfather.lemma_names():
gflemma_holonym[lemma].update(holonyms)
print gflemma_holonym[u'edible_nut']
print
print gflemma_holonym[u'geographical_area']
[out]:
set([u'black_hickory', u'black_walnut', u'Juglans_nigra', u'black_walnut_tree'])
set([u'battlefield', u'fair', u'infield', u'field_of_honor', u'field_of_battle', u'battleground', u'city', u'bowl', u'field', u'stadium', u'funfair', u'outfield', u'diamond', u'urban_area', u'populated_area', u'desert', u'arena', u'carnival', u'baseball_diamond', u'sports_stadium', u'ball_field', u'baseball_field'])
Please note that wordnet inventory is limited. Especially when you are looking for relations of concepts/lemma that is so far apart (i.e. from synset's grandfather to synset's holonym)
You can use ether lists or dictionary to do this ( dictionary is more pythonic ).
for exemple with dictionnary you have something like this :
dictionnary={"dog": {"dalmatian","stuff"}, "singer": {"rihanna","eminem"}, "country": {"United states","England"}}
print(dictionnary['dog'])

Wordnet Synonyms not returning all values nltk

I am trying to get all the synonyms or similar words using nltk's wordnet but it is not returning.
I am doing:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('swap')
[Synset('barter.n.01'), Synset('trade.v.04'), Synset('swap.v.02')]
I also tried doing (from one of the stackoverflow page):
>>> for ss in wn.synsets('swap'):
for sim in ss.similar_tos():
print(' {}'.format(sim))
But I am not getting all the synonyms. I do not want to add synonyms to the wordnet.
I am expecting it to return exchange,interchange, substitute etc.
How to achieve this?
Thanks
Abhi
To get synonyms using wordnet, simply do this:
>>> from nltk.corpus import wordnet as wn
>>> for synset in wn.synsets('swap'):
for lemma in synset.lemmas():
print lemma.name(),
barter swap swop trade trade swap swop switch swap # note the overlap between the synsets
To obtain some of the words you mentioned, you may have to include hypernyms as well:
>>> for synset in wn.synsets('swap'):
for hypernym in synset.hypernyms():
for ss in hypernym.lemmas(): # now you need to iterate through each synset returned by synset.hypernyms()
print ss.name(),
exchange interchange exchange change interchange travel go move locomote # again, some overlap

recursive extract synonym for a new word from NLTK

Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.

Categories

Resources