My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']
Related
I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))
returning:
political = political
politics = politics
Where I want to return:
political = politics
politics = politics
Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:
from itertools import chain
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
[out]:
{'political'}
{'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
Maybe there are other tools out there that can do that, but I'll try this as a first.
First, stem all lemma names and group the lemmas with the same stem:
from collections import defaultdict
from wn import WordNet
from nltk.stem import PorterStemmer
porter = PorterStemmer()
wn = WordNet()
x = defaultdict(set)
i = 0
for lemma_name in wn.all_lemma_names():
if lemma_name:
x[porter.stem(lemma_name)].add(lemma_name)
i += 1
Note: pip install -U wn
Then as a sanity check, we check that the no. of lemmas > no. of groups:
print(len(x.keys()), i)
[out]:
(128442, 147306)
Then we can take a look at the groupings:
for k in sorted(x):
if len(x[k]) > 1:
print(k, x[k])
It seems to do what we need to group the words together with their "root word", e.g.
poke {'poke', 'poking'}
polar {'polarize', 'polarity', 'polarization', 'polar'}
polaris {'polarisation', 'polarise'}
pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
poleax {'poleaxe', 'poleax'}
polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
police_st {'police_state', 'police_station'}
polish {'polished', 'polisher', 'polish', 'polishing'}
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
poll {'poll', 'polls'}
But if we look closer there is some confusion:
polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
So I would suggest the next step is
to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)
Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)
Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).
Finally using the resource of groups of words you've created, you can not "find the root word.
I'm familiar with word stemming and completion from the tm package in R.
I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".
If I had to do it right now, I would probably just go with something like:
library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"),
ignore.case = T, x = dictionary, value = T)
I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.
I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.
This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.
from collections import defaultdict
from stemming.porter2 import stem
with open('/usr/share/dict/words') as f:
words = f.read().splitlines()
stems = defaultdict(list)
for word in words:
word_stem = stem(word)
stems[word_stem].append(word)
if __name__ == '__main__':
word = 'leukocyte'
word_stem = stem(word)
print(stems[word_stem])
For the /usr/share/dict/words corpus, this produces the result
['leukocyte', "leukocyte's", 'leukocytes']
It uses the stemming module that can be installed with
pip install stemming
I'm trying to write a script that will look through my corpus which contains 93,000 txt files and find the frequency distributions of the trigrams present across all of them (so not separate frequency distributions but one frequency distribution for the entire corpus). I've gotten it to do the frequency distributions for a single file in the corpus but don't have the skills at all to get any further. Here's the code:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
corpus_root = '/Users/jolijttamanaha/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
f = speeches.open('Mr. THOMPSON of Pennsylvania.2010-12-07.2014sep17_at_233337.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
for k,v in fdist.items():
print k,v
Thank you in advance for your help.
Once you define your speeches corpus with PlaintextCorpusReader as you have, you can get trigrams for the entire corpus very simply:
fdist = nltk.FreqDist(nltk.trigrams(speeches.words()))
But this has an undesirable glitch: It forms trigrams that span from the end of one file to the next. But such trigrams do not represent tokens that could follow each other in a text-- they are completely accidental. What you really want is to combine the trigram counts from each individual file, which you can get like this:
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
fdist.update(nltk.trigrams(speeches.words(filename)))
Your fdist now contains the cumulative statistics, which you can examine in the various available ways. E.g.,
fdist.tabulate(10)
For pre-coded corpora API, instead of using corpus.raw(), you can try also corpus.words(), e.g.
>>> from nltk.util import ngrams
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> trigrams = ngrams(brown.words(), 3)
>>> for i in trigrams:
... print i
As #alexis pointed out, the code above should also work for custom corpora loaded with PlaintextCorpusReader, see http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively
Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).
I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?
You might want to look at this example:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'
(from this SO answer) to see if it sends you in the right direction.
First extract all the possible candidates from wordnet synsets.
Then use difflib to compare the strings against your target stem.
>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]
A more human readable way to compute the candidates is as such:
candidates = set()
for ss in wn.all_synsets('n'):
for ln in ss.lemma_names: # get all possible lemmas for this synset.
for lemma in ln:
if target in lemma:
candidates.add(target)
Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.