How to get similar words in wordnet (not just synonyms)? - python

How to get similar words using wordnet, and not only the synonyms using synsets and their lemmas?
For example, if you search for "happy" on the wordnet online tool (http://wordnetweb.princeton.edu/). For the first synset there is only one synonym (happy) but if you click on it (on the S: link) you get additional words in "see also" and "similar to" words, like "cheerful".
How do I get these words and what are they called in wordnet terminology? I am using python with nltk and can only get the synsets and lemmas at best (excluding the hypernyms etc.)

"also_sees()" and "similar_tos()".
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets("happy")[0].also_sees()
[Synset('cheerful.a.01'), Synset('contented.a.01'), Synset('elated.a.01'), Synset('euphoric.a.01'), Synset('felicitous.a.01'), Synset('glad.a.01'), Synset('joyful.a.01'), Synset('joyous.a.01')]
>>> wn.synsets("happy")[0].similar_tos()
[Synset('blessed.s.06'), Synset('blissful.s.01'), Synset('bright.s.09'), Synset('golden.s.02'), Synset('laughing.s.01')]
If you want to see the full list of what a WordNet synset can do, try the "dir()" command. (It'll be full of objects you probably don't want, so I stripped out the underscored below.)
>>> [func for func in dir(wn.synsets("happy")[0]) if func[0] != "_"]
['acyclic_tree', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'mst', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'usage_domains', 'verb_groups', 'wup_similarity']

Related

python lemmatizer that lemmatize "political" and "politics" to the same word

I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))
returning:
political = political
politics = politics
Where I want to return:
political = politics
politics = politics
Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:
from itertools import chain
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
[out]:
{'political'}
{'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
Maybe there are other tools out there that can do that, but I'll try this as a first.
First, stem all lemma names and group the lemmas with the same stem:
from collections import defaultdict
from wn import WordNet
from nltk.stem import PorterStemmer
porter = PorterStemmer()
wn = WordNet()
x = defaultdict(set)
i = 0
for lemma_name in wn.all_lemma_names():
if lemma_name:
x[porter.stem(lemma_name)].add(lemma_name)
i += 1
Note: pip install -U wn
Then as a sanity check, we check that the no. of lemmas > no. of groups:
print(len(x.keys()), i)
[out]:
(128442, 147306)
Then we can take a look at the groupings:
for k in sorted(x):
if len(x[k]) > 1:
print(k, x[k])
It seems to do what we need to group the words together with their "root word", e.g.
poke {'poke', 'poking'}
polar {'polarize', 'polarity', 'polarization', 'polar'}
polaris {'polarisation', 'polarise'}
pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
poleax {'poleaxe', 'poleax'}
polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
police_st {'police_state', 'police_station'}
polish {'polished', 'polisher', 'polish', 'polishing'}
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
poll {'poll', 'polls'}
But if we look closer there is some confusion:
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
So I would suggest the next step is
to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)
Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)
Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).
Finally using the resource of groups of words you've created, you can not "find the root word.

Python: CountVectorizer ignores one letter word "I"

I have a list called dictionary1. I use the following code to get sparse count matrices of texts:
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)
cv1.fit_transform(dictionary1)
I notice however that
list(set(dictionary1)-set(cv1.get_feature_names()))
results in ['i']. So "i" is in my dictionary but CountVectorizer ignores it (presumably some default setting discards one-char words). In the documentation I could not find such an option. Can someone point me to the problem? Indeed I would like to keep "i" in my analysis, as it could refer to more personal language.
A working work-around is passing the dictionary as the vocabulary directly (actually I don't know why I did not do thath in the first place). I.e.
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=[], vocabulary=dictionary1)
cv1._validate_vocabulary()
list(set(dictionary1)-set(cv1.get_feature_names())) then returns [].
In my original post, I should have mentioned that dictionary1 already is a list of unique tokens.
The default configuration tokenizes the string by extracting words of at least 2 letters.
Check out this link to see more details about sklearn vectorizers.
In your case, you should use a different tokenizer, not analyzer. For example, you can use TweetTokenizer from nltk library:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
corpus = [...some_texts...]
tk = TweetTokenizer()
vectorizer = CountVectorizer(tokenizer=tk.tokenize)
x = vectorizer.fit_transform(corpus)
For example, if corpus is defined as below, you would get:
corpus = ['I love ragdolls',
'I received a cat',
'I take it as my best friend']
vectorizer.get_feature_names()
> ['a', 'as', 'best', 'cat', 'friend', 'i', 'it', 'love', 'my', 'ragdolls', 'received', 'take']

How compare wordnet Synsets with another word?

I need to check if some word its sysnset of another words..
for example :
cat and dog ..
first i need to find synsets of cat by this code:
list= wn.synsets('cat')
then the list of synsets are returned:
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')
So, now I need to check if dog in this list ???
How can I do it by nltk Python code?
from nltk.corpus import wordnet as wn
for s in wn.synsets('cat'):
lemmas = s.lemmas()
for l in lemmas:
if l.name() == 'dog':
print l.synset()
Notice that this code searches a joint synset of two words which are considered to be synonyms (so nothing will be found with your 'cat' and 'dog' example). However, there are also other relations in wordnet. For instance, you can search for a 'cat' synset that contains 'dog' as antonym.

Wordnet Synonyms not returning all values nltk

I am trying to get all the synonyms or similar words using nltk's wordnet but it is not returning.
I am doing:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('swap')
[Synset('barter.n.01'), Synset('trade.v.04'), Synset('swap.v.02')]
I also tried doing (from one of the stackoverflow page):
>>> for ss in wn.synsets('swap'):
for sim in ss.similar_tos():
print(' {}'.format(sim))
But I am not getting all the synonyms. I do not want to add synonyms to the wordnet.
I am expecting it to return exchange,interchange, substitute etc.
How to achieve this?
Thanks
Abhi
To get synonyms using wordnet, simply do this:
>>> from nltk.corpus import wordnet as wn
>>> for synset in wn.synsets('swap'):
for lemma in synset.lemmas():
print lemma.name(),
barter swap swop trade trade swap swop switch swap # note the overlap between the synsets
To obtain some of the words you mentioned, you may have to include hypernyms as well:
>>> for synset in wn.synsets('swap'):
for hypernym in synset.hypernyms():
for ss in hypernym.lemmas(): # now you need to iterate through each synset returned by synset.hypernyms()
print ss.name(),
exchange interchange exchange change interchange travel go move locomote # again, some overlap

Getting the closest noun from a stemmed word

Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively
Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).
I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?
You might want to look at this example:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'
(from this SO answer) to see if it sends you in the right direction.
First extract all the possible candidates from wordnet synsets.
Then use difflib to compare the strings against your target stem.
>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]
A more human readable way to compute the candidates is as such:
candidates = set()
for ss in wn.all_synsets('n'):
for ln in ss.lemma_names: # get all possible lemmas for this synset.
for lemma in ln:
if target in lemma:
candidates.add(target)

Categories

Resources