I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
However, then I will miss important bigrams and trigrams in my dataset.
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.
I am new to wordvec and struggling how to do it. Please help me.
First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc
>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on.
Example:
trigram_model = Phrases(bigram_sentences)
Also there is a good notebook and video that explains how to use that .... the notebook, the video
The most important part of it is how to use it in real life sentences which is as follows:
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
Hope this helps you, but next time give us more information on what you are using and etc.
P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = [
"the mayor of new york was there",
"machine learning can be useful sometimes",
"new york mayor was present"
]
sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
print(bigram_phraser)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
Phrases and Phraser are those you should looking for
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
Once you are enough done with adding vocabs then use Phraser for faster access and efficient memory usage. Not mandatory but useful.
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
Thanks,
Related
I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my collection I would like to extract unigrams and the corresponding pos-tag of that word.
For instance if I've the following:
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
Then I would get the following unigrams output:
array(['embracing', 'holding', 'packages', 'women'], dtype=object)
But I don't know how to retain the part of speech tag after this. I tried to do a lookup version with the unigrams, but as they may differ from the words in the sentence (if you for instance do sentence.split(' ')) you don't necessarily get the same tokens. Any suggestions of how I can extract unigrams and retain the corresponding part-of-speech tag?
After reviewing the source code for the sklearn CountVectorizer class, particularly the fit function, I don't believe the class has any way of tracking the original document element indexes relative to the extracted unigram features: where the unigram features do not necessarily have the same tokens. Other than the simple solution provided below, you might have to rely on some other method/library to achieve your desired results. If there is a particular case that fails, I'd suggest adding that to your question as it might help people generate solutions to your problem.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent': ['Two women are embracing while holding to go packages .'],
'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []
for unigram in sentence_unigrams:
for i in range(len(sent_token_list)):
if sent_token_list[i] == unigram:
sentence_tags.append(tags_token_list[i])
print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']
A homograph is a word that has the same spelling as another word but has a different sound and a different meaning, for example,lead (to go in front of) / lead (a metal) .
I was trying to use spacy word vectors to compare documents with each other by summing each word vector for each document and then finally finding cosine similarity. If for example spacy vectors have the same vector for the two 'lead' listed above , the results will be probably bad.
In the code below , why does the similarity between the two 'bank'
tokens come out as 1.00 ?
import spacy
nlp = spacy.load('en')
str1 = 'The guy went inside the bank to take out some money'
str2 = 'The house by the river bank.'
str1_tokenized = nlp(str1.decode('utf8'))
str2_tokenized = nlp(str2.decode('utf8'))
token1 = str1_tokenized[-6]
token2 = str2_tokenized[-2]
print 'token1 = {} token2 = {}'.format(token1,token2)
print token1.similarity(token2)
The output for given program is
token1 = bank token2 = bank
1.0
As kntgu already pointed out, spaCy distinguishes tokens by their characters, not by their semantic meaning. The sense2vec approach by the developers of spaCy concatenates tokens with their POS-tag and can help in the case of 'lead_VERB' vs. 'lead_NOUN'. However, it will not help in your example of 'bank (river bank)' vs. 'bank (financial institute)', as both are nouns.
SpaCy does not support any solution to this out of the box, but you can have a look at contextualized word representations like ELMo or BERT. Both generate word vectors for a given sentence, taking the context into account. Therefore, I assume the vectors for both 'bank' tokens will be substantially different.
Both are relatively recent approaches and are not as comfortable to use, but might help in your use case. For ELMo, there is a command line tool which lets you generate word embeddings for a set of sentences without having to write any code: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk
I want to extract all bigrams and trigrams of the given sentences.
from gensim.models import Phrases
documents = ["the mayor of new york was there", "Human Computer Interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
trigram = Phrases(bigram(sentence_stream, min_count=1, threshold=2, delimiter=b' '))
for sent in sentence_stream:
#print(sent)
bigrams_ = bigram[sent]
trigrams_ = trigram[bigrams_]
print(bigrams_)
print(trigrams_)
The code works fine for bigrams and capture 'new york' and 'machine learning' ad bigrams.
However, I get the following error when I try to insert trigrams.
TypeError: 'Phrases' object is not callable
Please let me know, how to correct my code.
I am following the example documentation of gensim.
According to the docs, you can do:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
phrases = Phrases(sentence_stream)
bigram = Phraser(phrases)
trigram = Phrases(bigram[sentence_stream])
bigram, being a Phrases object, cannot be called again, as you are doing so.
from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?
These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.
I wrote a simple document classifier and I am currently testing it on the Brown Corpus. However, my accuracy is still very low (0.16). I've already excluded stopwords. Any other ideas on how to improve the classifier's performance?
import nltk, random
from nltk.corpus import brown, stopwords
documents = [(list(brown.words(fileid)), category)
for category in brown.categories()
for fileid in brown.fileids(category)]
random.shuffle(documents)
stop = set(stopwords.words('english'))
all_words = nltk.FreqDist(w.lower() for w in brown.words() if w in stop)
word_features = list(all_words.keys())[:3000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
If that's really your code, it's a wonder you get anything at all. w.lower is not a string, it's a function (method) object. You need to add the parentheses:
>>> w = "The"
>>> w.lower
<built-in method lower of str object at 0x10231e8b8>
>>> w.lower()
'the'
(But who knows really. You need to fix the code in your question, it's full of cut-and-paste errors and who knows what else. Next time, help us help you better.)
I would start by changing the first comment from:
import corpus documents = [(list(brown.words(fileid)), category) to:
documents = [(list(brown.words(fileid)), category) ...
In addition to changing the w.lower as the other answer says.
Changing this and following these two links below which implements a basic Naive Classifier without removing stop words gave me an accuracy of 33% which is a lot higher than 16%.
https://pythonprogramming.net/words-as-features-nltk-tutorial/
https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/?completed=/words-as-features-nltk-tutorial/
There are lots of things you can try to see if it improves your accuracy:
1- removing stop words
2- removing punctuation
3- removing the most common words and the least common words
4- normalizing the text
5- stemming or lemmatizing the text
6- I think this feature-set gives True if the word is present and False if it is not present. You can implement a count or a frequency.
7- You can use unigrams, bigrams and trigrams or combinations of those.
Hope that helped