Word2Vec Vocab Similarities

Word2Vec Vocab Similarities - python

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

Related

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

Extracting Key-Phrases from text based on the Topic with Python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

How to embed user names in word2vec model in gensim

I have some volunteer essay writings in the format of:
volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...
I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.
I am happy to provide more details if needed.

Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).
Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.

word2Vec and abbreviations

I am working on text classification task where my dataset contains a lot of abbreviations and proper nouns. For instance: Milka choc. bar.
My idea is to use bidirectional LSTM model with word2vec embedding.
And here is my problem how to code words, that not appears in the dictionary?
I partially solved this problem by merging pre-trained vectors with randomly initialized. Here is my implementation:
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('ru.vec', binary=False, unicode_errors='ignore')
EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,num_words)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))
for word, i in word_index.items():
if i>=num_words:
continue
try:
embedding_vector = word_vectors[word]
embedding_matrix[i] = embedding_vector
except KeyError:
embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),EMBEDDING_DIM)
def LSTMModel(X,words_nb, embed_dim, num_classes):
_input = Input(shape=(X.shape[1],))
X = embedding_layer = Embedding(words_nb,
embed_dim,
weights=[embedding_matrix],
trainable=True)(_input)
X = The_rest_of__the_LSTM_model()(X)
Do you think, that allowing the model to adjust the embedding weights is a good idea?
Could you please tell me, how can I encode words like choc? Obviously, this abbreviation stands for chocolate.

It is often not a good idea to adjust word2vec embeddings if you do not have sufficiently large corpus in your training. To clarify that, take an example where your corpus has television but not TV. Even though they might have word2vec embeddings, after training only television will be adjust and not TV. So you disrupt the information from word2vec.
To solve this problem you have 3 options:
You let the LSTM in the upper layer figure out what the word might mean based on its context. For example, I like choc. the LSTM can figure out it is an object. This was demonstrated by Memory Networks.
Easy option, pre-process, canonicalise as much as you can before passing to the model. Spell checkers often capture these very well and are really fast.
You can use character encoding along side word2vec. This is employed in many of the question answering models such as BiDAF where the character representation is merged with word2vec so you have some information relating characters to words. In this case, choc might be similar to chocolate.

One way to do this would be to add a function that maps your abbreviations to existing vectors that are most likely to be related ie: initialize the choc vector to the chocolate vector in w2v.
word_in_your_embedding_matrix[:len(abbreviated_word)]
There are two possible cases:
There is only one candidate that starts with the same n letters as your abbreviation, then, you can initialize your abbreviation embedding with that vector.
There are multiple items that start with the same n letters as your abbreviation, you can use the average as the yout initialization vector.

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

I want to use part of speech (POS) returned from nltk.pos_tag for sklearn classifier, How can I convert them to vector and use it?
e.g.
sent = "This is POS example"
tok=nltk.tokenize.word_tokenize(sent)
pos=nltk.pos_tag(tok)
print (pos)
This returns following
[('This', 'DT'), ('is', 'VBZ'), ('POS', 'NNP'), ('example', 'NN')]
Now I am unable to apply any of the vectorizer (DictVectorizer, or FeatureHasher, CountVectorizer from scikitlearn to use in classifier
Pls suggest

If I'm understanding you right, this is a bit tricky. Once you tag it, your sentence (or document, or whatever) is no longer composed of words, but of pairs (word + tag), and it's not clear how to make the most useful vector-of-scalars out of that.
Most text vectorizers do something like counting how many times each vocabulary item occurs, and then making a feature for each one:
the: 4, player: 1, bats: 1, well: 2, today: 3,...
The next document might have:
the: 0, quick:5, flying:3, bats:1, caught:1, bugs:2
Both can be stored as arrays of integers so long as you always put the same key in the same array element (you'll have a lot of zeros for most documents) -- or as a dict. So a vectorizer does that for many "documents", and then works on that.
So your question boils down to how to turn a list of pairs into a flat list of items that the vectorizors can count.
The most trivial way is to flatten your data to
('This', 'POS_DT', 'is', 'POS_VBZ', 'POS', 'POS_NNP', 'example', 'POS_NN')
The usual counting would then get a vector of 8 vocabulary items, each occurring once. I renamed the tags to make sure they can't get confused with words.
That would get you up and running, but it probably wouldn't accomplish much. That's because just knowing how many occurrences of each part of speech there are in a sample may not tell you what you need -- notice that any notion of which parts of speech go with which words is gone after the vectorizer does its counting.
Running a classifier on that may have some value if you're trying to distinguish something like style -- fiction may have more adjectives, lab reports may have fewer proper names (maybe), and so on.
Instead, you could change your data to
('This_DT', 'is_VBZ', 'POS_NNP', 'example_NN')
That keeps each tag "tied" to the word it belongs with, so now the vectors will be able to distinguish samples where "bat" is used as a verbs, from samples where it's only used as a noun. That would tell you slightly different things -- for example, "bat" as a verb is more likely in texts about baseball than in texts about zoos.
And there are many other arrangements you could do.
To get good results from using vector methods on natural language text, you will likely need to put a lot of thought (and testing) into just what features you want the vectorizer to generate and use. It depends heavily on what you're trying to accomplish in the end.
Hope that helps.

I know this is a bit late, but gonna add an answer here.
Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following:
word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn
Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM.
This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

What about merging the word and its tag like 'word/tag' then you may feed your new corpus to a vectorizer that count the word (TF-IDF or word of bags) then make a feature for each one:
wpt = nltk.WordPunctTokenizer()
text = wpt.tokenize('Someone should have this ring to a volcano')
text_tagged = nltk.pos_tag(text)
new_text = []
for word in text_tagged:
new_text.append(word[0] + "/" + word[1])
doc = ' '.join(new_text)
output for this is
Someone/NN should/MD have/VB this/DT piece/NN of/IN shit/NN to/TO a/DT volcano/NN

I think a better method would be to :
Step-1: Create word/sentence embeddings for each text/sentence.
Step-2: Calculate the POS-tags. Feed the POS-tags to a embedder as Step-1.
Step-3: Elementwise multiply the two vectors. (This is to ensure that the word-embeddings in each sentence is weighted by the POS-tags associated with it.
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.