Error getting trigrams using gensim's Phrases - python

I want to extract all bigrams and trigrams of the given sentences.
from gensim.models import Phrases
documents = ["the mayor of new york was there", "Human Computer Interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
trigram = Phrases(bigram(sentence_stream, min_count=1, threshold=2, delimiter=b' '))
for sent in sentence_stream:
#print(sent)
bigrams_ = bigram[sent]
trigrams_ = trigram[bigrams_]
print(bigrams_)
print(trigrams_)
The code works fine for bigrams and capture 'new york' and 'machine learning' ad bigrams.
However, I get the following error when I try to insert trigrams.
TypeError: 'Phrases' object is not callable
Please let me know, how to correct my code.
I am following the example documentation of gensim.

According to the docs, you can do:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
phrases = Phrases(sentence_stream)
bigram = Phraser(phrases)
trigram = Phrases(bigram[sentence_stream])
bigram, being a Phrases object, cannot be called again, as you are doing so.

Related

Extracting sentences containing a keyword using set()

I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]
If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.

How can I make compounds words singular using an NLP library?

Issue
I'm trying to make compounds words singular from plural using spaCy.
However, I cannot fix an error to transform plural to singular as compounds words.
How can I get the preferred output like the below?
cute dog
two or three word
the christmas day
Develop Environment
Python 3.9.1
Error
print(str(nlp(word).lemma_))
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'lemma_'
Code
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["cute dogs", "two or three words", "the christmas days"]
for word in words:
print(str(nlp(word).lemma_))
Trial
cute
dog
two
or
three
word
the
christmas
day
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["cute dogs", "two or three words", "the christmas days"]
for word in words:
word = nlp(word)
for token in word:
print(str(token.lemma_))
As you've found out, you can't get the lemma of a doc, only of individual words. Multi-word expressions don't have lemmas in English, lemmas are only for individual words. However, conveniently, in English compound words are pluralized just by pluralizing the last word, so you can just make the last word singular. Here's an example:
import spacy
nlp = spacy.load("en_core_web_sm")
def make_compound_singular(text):
doc = nlp(text)
if len(doc) == 1:
return doc[0].lemma_
else:
return doc[:-1].text + doc[-2].whitespace_ + doc[-1].lemma_
texts = ["cute dogs", "two or three words", "the christmas days"]
for text in texts:
print(make_compound_singular(text))

Extract Noun Phrases with Stanza and CoreNLPClient

I am trying to extract noun phrases from sentences using Stanza(with Stanford CoreNLP). This can only be done with the CoreNLPClient module in Stanza.
# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')
Here is an example of a sentence, and I am using the tregrex function in client to get all the noun phrases. Tregex function returns a dict of dicts in python. Thus I needed to process the output of the tregrex before passing it to the Tree.fromstring function in NLTK to correctly extract the Noun phrases as strings.
pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``
Hence, I came up with the method stanza_phrases which has to loop through the dict of dicts which is the output of tregrex and correctly format for Tree.fromstring in NLTK.
def stanza_phrases(matches):
Nps = []
for match in matches:
for items in matches['sentences']:
for keys,values in items.items():
s = '(ROOT\n'+ values['match']+')'
Nps.extend(extract_phrase(s, pattern))
return set(Nps)
generates a tree to be used by NLTK
from nltk.tree import Tree
def extract_phrase(tree_str, label):
phrases = []
trees = Tree.fromstring(tree_str)
for tree in trees:
for subtree in tree.subtrees():
if subtree.label() == label:
t = subtree
t = ' '.join(t.leaves())
phrases.append(t)
return phrases
Here is my output:
{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity', 'the theory', 'the theory of relativity'}
Is there a way I can make this more code efficient with less number of lines (especially, stanza_phrases and extract_phrase methods)
from stanza.server import CoreNLPClient
# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
pattern = 'NP'
matches = _client.tregex(_text,pattern,annotators=_annotators)
print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))
# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
print('---')
print(englishText)
noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")
# French example
with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
frenchText = "Je suis John."
print('---')
print(frenchText)
noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")
The Constituent-Treelib does exactly what you want to accomplish, and with very few lines of code.
First install it via: pip install constituent-treelib
Then, perform the following steps:
import spacy
from constituent_treelib import ConstituentTree
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
nlp_pipeline = ConstituentTree.create_pipeline(ConstituentTree.Language.English)
doc = nlp_pipeline(text)
extracted_phrases = []
for sent in doc.sents:
sentence = sent.text
tree = ConstituentTree(sentence, nlp_pipeline)
extracted_phrases.append(tree.extract_all_phrases())
# --------------------------------------------------------------
# Output of extracted_phrases:
[{'S': ['Albert Einstein was a German - born theoretical physicist .'],
'ADJP': ['German - born'],
'VP': ['was a German - born theoretical physicist'],
'NP': ['Albert Einstein', 'a German - born theoretical physicist']},
{'S': ['He developed the theory of relativity .'],
'PP': ['of relativity'],
'VP': ['developed the theory of relativity'],
'NP': ['the theory of relativity', 'the theory']}]

input a noun to get another noun that has the similar meaning. python, nltk

input: noun
output: noun
example:
Input: dog, I want to get an output of cat, wolf, tiger
or
Input: computer, I want to get an output of laptop, tablet, phone, etc.
can anyone write a python code that does this? preferbably using NLTK?
sample code:
text = nltk.Text(word.lower() for word in nltk.corpus.treebank.words())
y= text.similar('dog')
print (y)
Update : found this and it works better
from py_thesaurus import Thesaurus
thesaurus = Thesaurus('dog')
print(thesaurus.get_synonym())
For dog, even I am getting "No Matches".
Similar words depends upon corpus you are using. For now I have used the same corpus you used.
Code:
import nltk
text = nltk.Text(word.lower() for word in nltk.corpus.treebank.words())
text.similar('computer')
similar_words = text._word_context_index.similar_words('computer')
print(similar_words)
Output:
industrial insurance research manufacturing market car farm trading
presence stored
['manufacturing', 'industrial', 'insurance', 'research', 'market', 'presence', 'farm', 'trading', 'stored', 'car']

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
However, then I will miss important bigrams and trigrams in my dataset.
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.
I am new to wordvec and struggling how to do it. Please help me.
First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc
>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on.
Example:
trigram_model = Phrases(bigram_sentences)
Also there is a good notebook and video that explains how to use that .... the notebook, the video
The most important part of it is how to use it in real life sentences which is as follows:
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
Hope this helps you, but next time give us more information on what you are using and etc.
P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = [
"the mayor of new york was there",
"machine learning can be useful sometimes",
"new york mayor was present"
]
sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
print(bigram_phraser)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
Phrases and Phraser are those you should looking for
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
Once you are enough done with adding vocabs then use Phraser for faster access and efficient memory usage. Not mandatory but useful.
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
Thanks,

Categories

Resources