A homograph is a word that has the same spelling as another word but has a different sound and a different meaning, for example,lead (to go in front of) / lead (a metal) .
I was trying to use spacy word vectors to compare documents with each other by summing each word vector for each document and then finally finding cosine similarity. If for example spacy vectors have the same vector for the two 'lead' listed above , the results will be probably bad.
In the code below , why does the similarity between the two 'bank'
tokens come out as 1.00 ?
import spacy
nlp = spacy.load('en')
str1 = 'The guy went inside the bank to take out some money'
str2 = 'The house by the river bank.'
str1_tokenized = nlp(str1.decode('utf8'))
str2_tokenized = nlp(str2.decode('utf8'))
token1 = str1_tokenized[-6]
token2 = str2_tokenized[-2]
print 'token1 = {} token2 = {}'.format(token1,token2)
print token1.similarity(token2)
The output for given program is
token1 = bank token2 = bank
1.0
As kntgu already pointed out, spaCy distinguishes tokens by their characters, not by their semantic meaning. The sense2vec approach by the developers of spaCy concatenates tokens with their POS-tag and can help in the case of 'lead_VERB' vs. 'lead_NOUN'. However, it will not help in your example of 'bank (river bank)' vs. 'bank (financial institute)', as both are nouns.
SpaCy does not support any solution to this out of the box, but you can have a look at contextualized word representations like ELMo or BERT. Both generate word vectors for a given sentence, taking the context into account. Therefore, I assume the vectors for both 'bank' tokens will be substantially different.
Both are relatively recent approaches and are not as comfortable to use, but might help in your use case. For ELMo, there is a command line tool which lets you generate word embeddings for a set of sentences without having to write any code: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk
I use LDA package to model topics with a large set of text documents. A simplified(!) example (I removed all other cleaning steps, lemmatization, biograms etc.) of my code is below and I'm happy with the results so far. But now I struggle to write a code to predict a new text. I can't find any reference in LDA's documentation about save/loading/predict options. I can add a new text to my set and fit it again but it is an expensive way of doing it.
I know I can do it with gensim. But somehow the results from the gensim model are less impressive so I'd stick to my initial LDA model.
Will appreciate any suggestions!
My code:
import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english')) # nltk stopwords list
documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
'Game of Thrones stars Kit Harington and Rose Leslie to wed',
'Tony Booth: Till Death Us Do Part actor dies at 85',
'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
"Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
'How long can The Apprentice keep going?',
'Strictly Come Dancing beats X Factor for Saturday viewers',
"Joe Sugg: 8 things to know about one of YouTube's biggest stars",
'Sir Terry Wogan named greatest BBC radio presenter',
"DJs celebrate 50 years of Radio 1 and 2'"]
clean_docs = []
for doc in documents:
# set all to lower case and tokenize
tokens = nltk.tokenize.word_tokenize(doc.lower())
# remove stop words
texts = [i for i in tokens if i not in stops]
clean_docs.append(texts)
# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]
cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)
n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 3
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))
# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'
I want to extract all bigrams and trigrams of the given sentences.
from gensim.models import Phrases
documents = ["the mayor of new york was there", "Human Computer Interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
trigram = Phrases(bigram(sentence_stream, min_count=1, threshold=2, delimiter=b' '))
for sent in sentence_stream:
#print(sent)
bigrams_ = bigram[sent]
trigrams_ = trigram[bigrams_]
print(bigrams_)
print(trigrams_)
The code works fine for bigrams and capture 'new york' and 'machine learning' ad bigrams.
However, I get the following error when I try to insert trigrams.
TypeError: 'Phrases' object is not callable
Please let me know, how to correct my code.
I am following the example documentation of gensim.
According to the docs, you can do:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
phrases = Phrases(sentence_stream)
bigram = Phraser(phrases)
trigram = Phrases(bigram[sentence_stream])
bigram, being a Phrases object, cannot be called again, as you are doing so.
Input:"My favorite game is call of duty."
And I set "call of duty" as a key-words, this phrase will be one word in tokenize process.
Finally want to get the result:['my','favorite','game','is','call of duty']
So, how to set the key-words in python NLP ?
I think what you want is keyphrase extraction, and you can do it for instance by first tagging each word with it's PoS-tag and then apply some sort of regular expression over the PoS-tags to join interesting words into keyphrases.
import nltk
from nltk import pos_tag
from nltk import tokenize
def extract_phrases(my_tree, phrase):
my_phrases = []
if my_tree.label() == phrase:
my_phrases.append(my_tree.copy(True))
for child in my_tree:
if type(child) is nltk.Tree:
list_of_phrases = extract_phrases(child, phrase)
if len(list_of_phrases) > 0:
my_phrases.extend(list_of_phrases)
return my_phrases
def main():
sentences = ["My favorite game is call of duty"]
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
for x in sentences:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
print "\nNoun phrases:"
list_of_noun_phrases = extract_phrases(tree, 'NP')
for phrase in list_of_noun_phrases:
print phrase, "_".join([x[0] for x in phrase.leaves()])
if __name__ == "__main__":
main()
This will output the following:
Noun phrases:
(NP favorite/JJ game/NN) favorite_game
(NP call/NN) call
(NP duty/NN) duty
But,you can play around with
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
trying other types of expressions, so that you can get exactly what you want, depending on the words/tags you want to join together.
Also if you are interested, check this very good introduction to keyphrase/word extraction:
https://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/
This is, of course, way too late to be useful to the OP, but I thought I'd put this answer here for others:
It sounds like what you might be really asking is: How do I make sure that compound phrases like 'call of duty' get grouped together as one token?
You can use nltk's multiword expression tokenizer, like so:
string = 'My favorite game is call of duty'
tokenized_string = nltk.word_tokenize(string)
mwe = [('call', 'of', 'duty')]
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
tokenized_string = mwe_tokenizer.tokenize(tokenized_string)
Where mwestands for multi-word expression. The value of tokenized_string will be ['My', 'favorite', 'game', 'is', 'call of duty']
WordNet is great, but I'm having a hard time getting synonyms in nltk. If you search similar to for the word 'small' like here, it shows all of the synonyms.
Basically I just need to know the following:
wn.synsets('word')[i].option() Where option can be hypernyms and antonyms, but what is the option for getting synonyms?
If you want the synonyms in the synset (aka the lemmas that make up the set), you can get them with lemma_names():
>>> for ss in wn.synsets('small'):
>>> print(ss.name(), ss.lemma_names())
small.n.01 ['small']
small.n.02 ['small']
small.a.01 ['small', 'little']
minor.s.10 ['minor', 'modest', 'small', 'small-scale', 'pocket-size', 'pocket-sized']
little.s.03 ['little', 'small']
small.s.04 ['small']
humble.s.01 ['humble', 'low', 'lowly', 'modest', 'small']
...
You can use wordnet.synset and lemmas in order to get all the synonyms:
example :
from itertools import chain
from nltk.corpus import wordnet
synonyms = wordnet.synsets(text)
lemmas = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
Demo:
>>> synonyms = wordnet.synsets('change')
>>> set(chain.from_iterable([word.lemma_names() for word in synonyms]))
set([u'interchange', u'convert', u'variety', u'vary', u'exchange', u'modify', u'alteration', u'switch', u'commute', u'shift', u'modification', u'deepen', u'transfer', u'alter', u'change'])
You might be interested in a Synset:
>>> wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
That's the same list of top-level entries that the web interface gave you.
If you also want the "similar to" list, that's not the same thing as the synonyms. For that, you call similar_tos() on each Synset.
So, to show the same information as the website, start with something like this:
for ss in wn.synsets('small'):
print(ss)
for sim in ss.similar_tos():
print(' {}'.format(sim))
Of course the website is also printing the part of speech (sim.pos), list of lemmas (sim.lemma_names), definition (sim.definition), and examples (sim.examples) for each synset at both levels. and it's grouping them by parts of speech, and it's added in links to other things that you can follow, and so forth. But that should be enough to get you started.
Simplest program to print the synonyms of a given word
from nltk.corpus import wordnet
for syn in wordnet.synsets("good"):
for name in syn.lemma_names():
print(name)
Here are some helper functions to make NLTK easier to use, and two examples of how those functions can be used.
def download_nltk_dependencies_if_needed():
try:
nltk.word_tokenize('foobar')
except LookupError:
nltk.download('punkt')
try:
nltk.pos_tag(nltk.word_tokenize('foobar'))
except LookupError:
nltk.download('averaged_perceptron_tagger')
def get_some_word_synonyms(word):
word = word.lower()
synonyms = []
synsets = wordnet.synsets(word)
if (len(synsets) == 0):
return []
synset = synsets[0]
lemma_names = synset.lemma_names()
for lemma_name in lemma_names:
lemma_name = lemma_name.lower().replace('_', ' ')
if (lemma_name != word and lemma_name not in synonyms):
synonyms.append(lemma_name)
return synonyms
def get_all_word_synonyms(word):
word = word.lower()
synonyms = []
synsets = wordnet.synsets(word)
if (len(synsets) == 0):
return []
for synset in synsets:
lemma_names = synset.lemma_names()
for lemma_name in lemma_names:
lemma_name = lemma_name.lower().replace('_', ' ')
if (lemma_name != word and lemma_name not in synonyms):
synonyms.append(lemma_name)
return synonyms
Example 1: get_some_word_synonyms
This approach tends to return the most relevant synonyms, but some words like "angry" won't return any synonyms.
download_nltk_dependencies_if_needed()
words = ['dog', 'fire', 'erupted', 'throw', 'sweet', 'center', 'said', 'angry', 'iPhone', 'ThisIsNotARealWorddd', 'awesome', 'amazing', 'jim dandy', 'change']
for word in words:
print('Synonyms for {}:'.format(word))
synonyms = get_some_word_synonyms(word)
for synonym in synonyms:
print(" {}".format(synonym))
Example 1 output:
Synonyms for dog:
domestic dog
canis familiaris
Synonyms for fire:
Synonyms for erupted:
erupt
break out
Synonyms for throw:
Synonyms for sweet:
henry sweet
Synonyms for center:
centre
middle
heart
eye
Synonyms for said:
state
say
tell
Synonyms for angry:
Synonyms for iPhone:
Synonyms for ThisIsNotARealWorddd:
Synonyms for awesome:
amazing
awe-inspiring
awful
awing
Synonyms for amazing:
amaze
astonish
astound
Synonyms for jim dandy:
Synonyms for change:
alteration
modification
Example 2: get_all_word_synonyms
This approach will return all possible synonyms, but some may not be very relevant.
download_nltk_dependencies_if_needed()
words = ['dog', 'fire', 'erupted', 'throw', 'sweet', 'center', 'said', 'angry', 'iPhone', 'ThisIsNotARealWorddd', 'awesome', 'amazing', 'jim dandy', 'change']
for word in words:
print('Synonyms for {}:'.format(word))
synonyms = get_some_word_synonyms(word)
for synonym in synonyms:
print(" {}".format(synonym))
Example 2 output:
Synonyms for dog:
domestic dog
canis familiaris
frump
cad
bounder
blackguard
hound
heel
frank
frankfurter
hotdog
hot dog
wiener
wienerwurst
weenie
pawl
detent
click
andiron
firedog
dog-iron
chase
chase after
trail
tail
tag
give chase
go after
track
Synonyms for fire:
firing
flame
flaming
ardor
ardour
fervor
fervour
fervency
fervidness
attack
flak
flack
blast
open fire
discharge
displace
give notice
can
dismiss
give the axe
send away
sack
force out
give the sack
terminate
go off
arouse
elicit
enkindle
kindle
evoke
raise
provoke
burn
burn down
fuel
Synonyms for erupted:
erupt
break out
irrupt
flare up
flare
break open
burst out
ignite
catch fire
take fire
combust
conflagrate
come out
break through
push through
belch
extravasate
break
burst
recrudesce
Synonyms for throw:
stroke
cam stroke
shed
cast
cast off
shake off
throw off
throw away
drop
thrust
give
flip
switch
project
contrive
bewilder
bemuse
discombobulate
hurl
hold
have
make
confuse
fox
befuddle
fuddle
bedevil
confound
Synonyms for sweet:
henry sweet
dessert
afters
confection
sweetness
sugariness
angelic
angelical
cherubic
seraphic
dulcet
honeyed
mellifluous
mellisonant
gratifying
odoriferous
odorous
perfumed
scented
sweet-scented
sweet-smelling
fresh
unfermented
sugared
sweetened
sweet-flavored
sweetly
Synonyms for center:
centre
middle
heart
eye
center field
centerfield
midpoint
kernel
substance
core
essence
gist
heart and soul
inwardness
marrow
meat
nub
pith
sum
nitty-gritty
center of attention
centre of attention
nerve center
nerve centre
snapper
plaza
mall
shopping mall
shopping center
shopping centre
focus on
center on
revolve around
revolve about
concentrate on
concentrate
focus
pore
rivet
halfway
midway
Synonyms for said:
state
say
tell
allege
aver
suppose
read
order
enjoin
pronounce
articulate
enounce
sound out
enunciate
aforesaid
aforementioned
Synonyms for angry:
furious
raging
tempestuous
wild
Synonyms for iPhone:
Synonyms for ThisIsNotARealWorddd:
Synonyms for awesome:
amazing
awe-inspiring
awful
awing
Synonyms for amazing:
amaze
astonish
astound
perplex
vex
stick
get
puzzle
mystify
baffle
beat
pose
bewilder
flummox
stupefy
nonplus
gravel
dumbfound
astonishing
awe-inspiring
awesome
awful
awing
Synonyms for jim dandy:
Synonyms for change:
alteration
modification
variety
alter
modify
vary
switch
shift
exchange
commute
convert
interchange
transfer
deepen
This worked for me
wordnet.synsets('change')[0].hypernyms()[0].lemma_names()
I've code Thesaurus Lookup for Synonym recently, I used this function :
def find_synonyms(keyword) :
synonyms = []
for synset in wordnet.synsets(keyword):
for lemma in synset.lemmas():
synonyms.append(lemma.name())
return str(synonyms)
But if you prefer to host your own Dictionary, you might interested with my project on offline synonym dictionary lookup on my github page :
https://github.com/syauqiex/offline_english_synonym_dictionary
Perhaps these are not synonyms in the proper terminology of wordnet. But I also want my function to return all similar words, like 'weeny', 'flyspeck' etc. You can see them for the word 'small' in the author link. I used these code:
from nltk.corpus import wordnet as wn
def get_all_synonyms(word):
synonyms = []
for ss in wn.synsets(word):
synonyms.extend(ss.lemma_names())
for sim in ss.similar_tos():
synonyms_batch = sim.lemma_names()
synonyms.extend(synonyms_batch)
synonyms = set(synonyms)
if word in synonyms:
synonyms.remove(word)
synonyms = [synonym.replace('_',' ') for synonym in synonyms]
return synonyms
get_all_synonyms('small')