I am supposed to do some exercises with python glove, most of it doesn't give me any problems but now i am supposed to find the 5 most similar words to "norway - war + peace" from the "glove-wiki-gigaword-100" package. But when i run my code it just says that the 'word' is not in the vocabulary. Now I'm guessing that this is some kind of formatting, but i don't know how to use it.
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100") # download the model and return as object ready for use
bests = model.most_similar("norway - war + peace", topn= 5)
print("5 most similar words to 'norway - war + peace':")
for best in bests:
print(best)
Gensim's model word2vec only deals with previously seen words. Here you give an entire sentence... What you want to do is:
get vectors v1, v2 and v3 for resp. words "norway", "war" and "peace".
Compute the math: v = v1 -v2 + v3.
get the most_similar words to v.
To do so, you will need these functions: model.wv.most_similar() and model.wv.similar_by_vector(). Note that model.wv.most_similar() does something similar to these three steps but in a more complicated way using a set of positive words and a set of negative words. See the documentation for details.
Related
I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!
There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)
I have some volunteer essay writings in the format of:
volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...
I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.
I am happy to provide more details if needed.
Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).
Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.
I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.
Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)
Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)
I want to compare the two sentences. As a example,
sentence1="football is good,cricket is bad"
sentence2="cricket is good,football is bad"
Generally these senteces have no relationship that means they are different meaning. But when I compare with python nltk tools it will give 100% similarity. How can I fix this Issue? I need Help.
Yes wup_similarity internally uses synsets for single tokens to calculate similarity
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
Since ancestor nodes for cricket and football would be same. wup_similarity will return 1.
If you want to fix this issue using wup_similarity is not a good choice.
Simplest token based way would be fitting a vectorizer and then calculating similarity.
Eg.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = ["football is good,cricket is bad", "cricket is good,football is bad"]
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(corpus)
x1 = vectorizer.transform(["football is good,cricket is bad"])
x2 = vectorizer.transform(["cricket is good,football is bad"])
cosine_similarity(x1, x2)
There are more intelligent methods to meaure semantic similarity though. One of them which can be tried easily is Google's USE Encoder.
See this link
Semantic Similarity is a bit tricky this way, since even if you use context counts (which would be n-grams > 5) you cannot cope with antonyms (e.g. black and white) well enough. Before using different methods, you could try using a shallow parser or dependency parser for extracting subject-verb or subject-verb-object relations (e.g. ), which you can use as dimensions. If this does not give you the expected similarity (or values adequate for your application), use word embeddings trained on really large data.
I want to build a model that can classification news into specific categorize. As i imagine that i will put all the selected train paper into specific label category then you word2vec for training and generate model?. I wonder does it possible?.
I have try some small example to build vocab in gensim but it keep telling me that word doesn't exist in vocab.. I'm so confuse.
randomTxt = 'loop is good. loop infinity is not good. they are good at some point.'
x = randomTxt.split() #This finds words in the document
a = Counter(x)
print x
w1 = 'so'
model1 = Word2Vec(randomTxt,min_count=0)
print model1.wv['loop']
I wonder if anyone have idea or know how to build from the beginning dataset can help me with this ? Or maybe some documentation is good.
I have read this docs: https://radimrehurek.com/gensim/models/word2vec.html
but as i follow like above, it keep telling me loop doesn't exist in vocabulary word2vec build.