Word sense disambiguation algorithm in Python [duplicate]

Word sense disambiguation algorithm in Python [duplicate] - python

This question already has answers here:
Word sense disambiguation in NLTK Python
(6 answers)
Closed 9 years ago.
I'm developing a simple NLP project, and I'm looking, given a text and a word, find the most likely sense of that word in the text.
Is there any implementation of wsd algorithms in Python? It's not quite clear whether there is something in NLTK that can help me. I'd be happy even with a naive implementation like Lesk Algorithm.
I've read similar questions like Word sense disambiguation in NLTK Python but they give nothing but a reference to a NLTK book, which is not very into WSD problem.

In short: https://github.com/alvations/pywsd
In long: There are endless techniques used for WSD, ranging from mind-blasting machine techniques that requires lots of GPU power to simply using information in wordnet or even just using frequencies, see http://dl.acm.org/citation.cfm?id=1459355 .
Let's start with the simple lesk algorithm that allows optional stems, see http://en.wikipedia.org/wiki/Lesk_algorithm:
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']
ps = PorterStemmer()
def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True):
max_overlaps = 0; lesk_sense = None
context_sentence = context_sentence.split()
for ss in wn.synsets(ambiguous_word):
# If POS is specified.
if pos and ss.pos is not pos:
continue
lesk_dictionary = []
# Includes definition.
lesk_dictionary+= ss.definition.split()
# Includes lemma_names.
lesk_dictionary+= ss.lemma_names
# Optional: includes lemma_names of hypernyms and hyponyms.
if hyperhypo == True:
lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()]))
if stem == True: # Matching exact words causes sparsity, so lets match stems.
lesk_dictionary = [ps.stem(i) for i in lesk_dictionary]
context_sentence = [ps.stem(i) for i in context_sentence]
overlaps = set(lesk_dictionary).intersection(context_sentence)
if len(overlaps) > max_overlaps:
lesk_sense = ss
max_overlaps = len(overlaps)
return lesk_sense
print "Context:", bank_sents[0]
answer = lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", bank_sents[1]
answer = lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", plant_sents[0]
answer = lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print
Other than lesk-like algorithms, there are different similarity measures that people tried, a good but out-dated but still useful survey: http://acl.ldc.upenn.edu/P/P97/P97-1008.pdf‎

You can try getting the first sense for each word using the WordNet incorporated in NLTK, using this short code:
from nltk.corpus import wordnet as wn
def get_first_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets[0]
best_synset = get_first_sense('bank')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','n')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','v')
print '%s: %s' % (best_synset.name, best_synset.definition)
Will print:
bank.n.01: sloping land (especially the slope beside a body of water)
set.n.01: a group of things of the same kind that belong together and are so used
put.v.01: put into a certain place or abstract location
This works quite well, surprisingly, as the first sense is often dominating the other senses.

For WSD in Python you can try to use Wordnet bindings in NLTK or Gensim library. The building blocks are there, but developing the complete algorithm is, probably, on you.
For instance, using Wordnet you can implement a Simplified Lesk algorithm, as described in the Wikipedia entry.

Related

concordance for a phrase using NLTK in Python

Is it possible to get concordance for a phrase in NLTK?
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)
corpus.concordance("claim")
for example the above returns
on okay okay okay i can give you the claim number and my information and
decide on the shop okay okay so the claim number is xxxx - xx - xxxx got
and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.

According to this issue it is not (yet) possible to search for multiple words with the concordance() function.

If you read the discussion under the very issue that #b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:
>>> from nltk.app import concordance
>>> concordance()

I munged together this solution...
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
#concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
phraseList=phrase.split(' ')
c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
#Find the offset for each token in the phrase
offsets=[c.offsets(x) for x in phraseList]
offsets_norm=[]
#For each token in the phraselist, find the offsets and rebase them to the start of the phrase
for i in range(len(phraseList)):
offsets_norm.append([x-i for x in offsets[i]])
#We have found the offset of a phrase if the rebased values intersect
#--
# http://stackoverflow.com/a/3852792/454773
#the intersection method takes an arbitrary amount of arguments
#result = set(d[0]).intersection(*d[1:])
#--
intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
for offset in intersects])
outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
return outputs
def n_concordance(txt,phrase,left_margin=5,right_margin=5):
tokens = nltk.word_tokenize(txt)
text = nltk.Text(tokens)
return
n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)
n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
u'; for Whales of a monstrous size are oftentimes cast up dead ']

recursive extract synonym for a new word from NLTK

Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?

You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Python: How to format large text outputs to be 'prettier' and user defined

Ahoy StackOverlow-ers!
I have a rather trivial question but it's something that I haven't been able to find in other questions here or on online tutorials: How might we be able to format the output of a Python program that so that it fits a certain aesthetic format without any extra modules?
The aim here is that I have a block of plain text like that from a newspaper article, and I've filtered through it earlier to extract just the words I want but now I'd like to print it out in the format that each line only has 70 characters along it and any word won't be broken if it should normally fall on a line break.
Using .ljust(70) as in stdout.write(article.ljust(70)) doesn't seem to do anything to it.
The other thing about not having words broken would be as:
Latest news tragic m
urder innocent victi
ms family quiet neig
hbourhood
Looking more like this:
Latest news tragic
murder innocent
victims family
quiet neighbourhood
Thank you all kindly in advance!

Checkout the python textwrap module (a standard module)
>>> import textwrap
>>> t="""Latest news tragic murder innocent victims family quiet neighbourhood"""
>>> print "\n".join(textwrap.wrap(t, width=20))
Latest news tragic
murder innocent
victims family quiet
neighbourhood
>>>

use textwrap module:
http://docs.python.org/library/textwrap.html

I'm sure this can be improved on. Without any libraries:
def wrap_text(text, wrap_column=80):
sentence = ''
for word in text.split(' '):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
EDIT: From the comment if you want to use Regular expressions to just pick out words use this:
import re
def wrap_text(text, wrap_column=80):
sentence = ''
for word in re.findall(r'\w+', text):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence

Understanding NLTK collocation scoring for bigrams and trigrams

Background:
I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.
Approach:
I coded the following in Python using NLTK (several steps and imports removed for brevity):
bgm = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio )
print scored
Results:
I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:
[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]
I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.
Questions:
Am I misunderstanding the use of collocations?
Is my code incorrect?
Is my assumption that the scores should be different wrong, and if so why?
Thank you very much for any information or help!

The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html
You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio )
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]
The output seems reasonable, works well for baseball, less so for doctor and happy.
doctor [('bills', 35.061321987405748), (',', 22.963930079491501),
('annoys', 19.009636692022365),
('had', 16.730384189212423), ('retorted', 15.190847940499127)]
baseball [('game', 32.110754519752291), ('cap', 27.81891372457088),
('park', 23.509042621473505), ('games', 23.105033513054011),
("player's", 16.227872863424668)]
happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589),
('family', 13.734352182441569),
(',', 13.55077617193821), ('bodybuilder', 13.513265447290536)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.