Background:
I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.
Approach:
I coded the following in Python using NLTK (several steps and imports removed for brevity):
bgm = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio )
print scored
Results:
I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:
[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]
I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.
Questions:
Am I misunderstanding the use of collocations?
Is my code incorrect?
Is my assumption that the scores should be different wrong, and if so why?
Thank you very much for any information or help!
The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html
You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio )
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]
The output seems reasonable, works well for baseball, less so for doctor and happy.
doctor [('bills', 35.061321987405748), (',', 22.963930079491501),
('annoys', 19.009636692022365),
('had', 16.730384189212423), ('retorted', 15.190847940499127)]
baseball [('game', 32.110754519752291), ('cap', 27.81891372457088),
('park', 23.509042621473505), ('games', 23.105033513054011),
("player's", 16.227872863424668)]
happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589),
('family', 13.734352182441569),
(',', 13.55077617193821), ('bodybuilder', 13.513265447290536)
Related
I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))
returning:
political = political
politics = politics
Where I want to return:
political = politics
politics = politics
Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:
from itertools import chain
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
[out]:
{'political'}
{'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
Maybe there are other tools out there that can do that, but I'll try this as a first.
First, stem all lemma names and group the lemmas with the same stem:
from collections import defaultdict
from wn import WordNet
from nltk.stem import PorterStemmer
porter = PorterStemmer()
wn = WordNet()
x = defaultdict(set)
i = 0
for lemma_name in wn.all_lemma_names():
if lemma_name:
x[porter.stem(lemma_name)].add(lemma_name)
i += 1
Note: pip install -U wn
Then as a sanity check, we check that the no. of lemmas > no. of groups:
print(len(x.keys()), i)
[out]:
(128442, 147306)
Then we can take a look at the groupings:
for k in sorted(x):
if len(x[k]) > 1:
print(k, x[k])
It seems to do what we need to group the words together with their "root word", e.g.
poke {'poke', 'poking'}
polar {'polarize', 'polarity', 'polarization', 'polar'}
polaris {'polarisation', 'polarise'}
pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
poleax {'poleaxe', 'poleax'}
polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
police_st {'police_state', 'police_station'}
polish {'polished', 'polisher', 'polish', 'polishing'}
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
poll {'poll', 'polls'}
But if we look closer there is some confusion:
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
So I would suggest the next step is
to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)
Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)
Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).
Finally using the resource of groups of words you've created, you can not "find the root word.
Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.
If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context).
One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.
Extending your code:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token, rev_token in zip(tokens[1:], tokens):
ngram_rev[token][rev_token] += 1
for token in ngram:
total = np.log(np.sum(ngram[token].values()))
total_rev = np.log(np.sum(ngram_rev[token].values()))
ngram[token] = {nxt: np.log(v) - total
for nxt, v in ngram[token].items()}
ngram_rev[token] = {prv: np.log(v) - total_rev
for prv, v in ngram_rev[token].items()}
Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.
You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.
The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.
The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger:
import nltk
from nltk.tag import ContextTagger
class TwoSidedTagger(ContextTagger):
left = 2
right = 1
def context(self, tokens, index, history):
left = self.left
right = self.right
tokens = tuple(t.lower() for t in tokens)
if index < left:
tokens = ("<start>",) * left + tokens
index += left
if index + right >= len(tokens):
tokens = tokens + ("<end>",) * right
return tokens[index-left:index+right+1]
This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:
data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)
Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training.
For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).
You can then use your tagger to tag new text, or evaluate it with an already tagged test set:
>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))
Predicting words:
The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL), where LABEL is the value you want to generate. In your case, LABEL is just the current word itself; so make your training data as follows:
data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)
It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context() so that the returned ngram does not include it:
def context(self, tokens, index, history):
...
return tokens[index-left:index] + tokens[index+1:index+right+1]
This tagger uses trigrams consisting of two words from the left and one from the right of the current word.
With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.
Prediction:
My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.
To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.
I would like to create a set of alternative words for a word. The alternative word has to be suitably different so that replacing 'dog' with 'dalmatian' is too similar- I would want to replace 'dog' with 'cat'. Although not infallible, I think I can do this by getting the hypernym for a word and ten that hypernym's hypernym (Ie that grandparent synset) and finally getting all the grandchildren words for that grandparent.
Hopefully this makes sense. In pseudocode it should read
for each i as hypernym (synset)
for each j as i.hypernym
get all the holonyms for j as s
for each s get all the holonyms as x
print x
Is this doable?
from itertools import chain
from collections import defaultdict
from nltk.corpus import wordnet as wn
gflemma_holonym = defaultdict(set)
for ss in wn.all_synsets():
if ss.part_holonyms() and ss.hypernyms() and ss.hypernyms()[0].hypernyms():
grandfather = ss.hypernyms()[0].hypernyms()[0] # grandfather concept.
holonyms = list(chain(*[i.lemma_names() for i in ss.part_holonyms()]))
for lemma in grandfather.lemma_names():
gflemma_holonym[lemma].update(holonyms)
print gflemma_holonym[u'edible_nut']
print
print gflemma_holonym[u'geographical_area']
[out]:
set([u'black_hickory', u'black_walnut', u'Juglans_nigra', u'black_walnut_tree'])
set([u'battlefield', u'fair', u'infield', u'field_of_honor', u'field_of_battle', u'battleground', u'city', u'bowl', u'field', u'stadium', u'funfair', u'outfield', u'diamond', u'urban_area', u'populated_area', u'desert', u'arena', u'carnival', u'baseball_diamond', u'sports_stadium', u'ball_field', u'baseball_field'])
Please note that wordnet inventory is limited. Especially when you are looking for relations of concepts/lemma that is so far apart (i.e. from synset's grandfather to synset's holonym)
You can use ether lists or dictionary to do this ( dictionary is more pythonic ).
for exemple with dictionnary you have something like this :
dictionnary={"dog": {"dalmatian","stuff"}, "singer": {"rihanna","eminem"}, "country": {"United states","England"}}
print(dictionnary['dog'])
Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.
I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()
From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]
What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!