Documentation topic classification using word2vec

Documentation topic classification using word2vec - python

I want to build a model that can classification news into specific categorize. As i imagine that i will put all the selected train paper into specific label category then you word2vec for training and generate model?. I wonder does it possible?.
I have try some small example to build vocab in gensim but it keep telling me that word doesn't exist in vocab.. I'm so confuse.
randomTxt = 'loop is good. loop infinity is not good. they are good at some point.'
x = randomTxt.split() #This finds words in the document
a = Counter(x)
print x
w1 = 'so'
model1 = Word2Vec(randomTxt,min_count=0)
print model1.wv['loop']
I wonder if anyone have idea or know how to build from the beginning dataset can help me with this ? Or maybe some documentation is good.
I have read this docs: https://radimrehurek.com/gensim/models/word2vec.html
but as i follow like above, it keep telling me loop doesn't exist in vocabulary word2vec build.

Related

Failing to Understand Doc2Vec Output

So I started down the path of attempting to learn Doc2Vec, specifically the cosine similarity output. Basically, I am getting an unexpected output when attempting to match a new sentence to the list of sentences I trained my model on. If anyone could help, that would be amazing, here's my code:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import word_tokenize
data = [
'I love machine learning'
,'I love coding in python'
,'I love building chatbots'
,'they chat amazingly well'
,'dog poops in my yard'
,'this is a stupid exercise'
,'I like math and statistics'
,'cox communications is a dumb face'
,'Machine learning in python is difficult'
]
tagged_data = [TaggedDocument(words = word_tokenize(d.lower()), tags = [str(i)]) for i, d in enumerate(data)]
max_epochs = 15
vec_size = 10
wndw = 2
alpha_num = 0.025
model = Doc2Vec(vector_size = vec_size
,window = wndw
,alpha = alpha_num
,min_alpha = 0.00025
,min_count = 1
,dm = 1)
model.build_vocab(tagged_data)
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, workers = 4, epochs = 100)
new_sent = 'machine learning in python is easy'.split(' ')
model.docvecs.most_similar(positive = [model.infer_vector(new_sent)])
The output I receive is this (and it's also random each time I run, so I'm not sure about that either):
[('2', 0.4818369746208191),
('5', 0.4623863697052002),
('3', 0.4057881236076355),
('4', 0.3984462022781372),
('8', 0.2882154583930969),
('7', 0.27972114086151123),
('6', 0.23783418536186218),
('0', 0.11647315323352814),
('1', -0.12095103412866592)]
Meaning the model is stating that 'I love coding in python' is the most similar to 'machine learning in python is easy', when I would expect 'Machine learning in python is difficult' to be the most similar. At least that's how I'm interpreting it.

You may be misunderstanding the output. The numbers are the indices of the training vectors. Therefore, it is most similar to the vector at index 2, i.e. I love building chatbots and least similar to vector at index 1 i.e. I love coding in python.
That being said, don't create two models, one for creating the vectors and one for testing. Only the model you create the vectors with understands the vectors, the other one doesn't.
The wacky results are probably because there isn't enough data for the machine to understand or develop a useful word embedding. Randomness could be because every time you run it a different RNG is rolled when creating word vectors. Try setting the random state if there is a way to do it.

Doc2Vec & similar algorithms don't work meaningfully on toy-sized datasets like this. The barest minimum to demo will be something with hundreds of texts, & tens-of-thousands of training words - and for such a (still very small) dataset, you'd again want to reduce the default vector_size to something small like your 10-20 values, rather than the default 100.
So first & foremost, test on a larger dataset. The original paper, and most other non-trivial demos, will use sets of texts in the tens-of-thousands, each at least a dozen and ideally many dozens or hundreds of words long.
Second, your current code is creating an initial Doc2Vec instance, then calling .build_vocab() on it, then... throwing that model away, and creating an all-new model in your second assignment into the model variable. You only need to create one, and it should have just the parameters you really want - not the mix of different parameters in your current code.
Turning on logging at the INFO level will provide output that will help you understand what steps are occurring - and as you learn to read the output, you may see confirmations of good progress, or indicators of problems.
Finally, min_count=1 is almost always a bad idea - these algorithms need multiple examples of a word's use for it not to be noise in the training, and it's usually better to discard singleton (& very-rare) words to allow the others to become better. (Once you're using a larger dataset, losing words than only appear 1 to a few times shouldn't be a big concern.)

Extracting Key-Phrases from text based on the Topic with Python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

Determine most similar phrase with word2vec

I create a Python script for training and inferring test document vectors using doc2vec.
My problem is when I try to determine the most similar phrase for example ("the world") it shows me only on the list of most similar words. It didn't shows the list of most similar phrase.
Am I missing something in my code?
#python example to infer document vectors from trained doc2vec model
import gensim.models as g
import codecs
#parameters
model="toy_data/model.bin"
test_docs="toy_data/test_docs.txt"
output_file="toy_data/test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()
m.most_similar('the word'.split())
I get this list :
[('refutations', 0.9990279078483582),
('volume', 0.9989271759986877),
('italic', 0.9988381266593933),
('syllogisms', 0.998751699924469),
('power', 0.9987285137176514),
('alibamu', 0.9985184669494629),
("''", 0.99847412109375),
('roman', 0.9984466433525085),
('soil', 0.9984269738197327),
('plants', 0.9984176754951477)]

The Doc2Vec model collects its doc-vectors for later lookup or search in a property .docvecs. To get doc-vector results, you would perform a most_similar() on that property. If your Doc2Vec instance is held in a variable d2v_model, and doc_id holds one of the known doc-tags from training, that might be:
d2v_model.docvecs.most_similar(doc_id)
If you were inferring a vector for a new document, and looking up training docs similar to that inferred vector, your code might be like:
new_dv = d2v_model.infer_vector('some new document'.split())
d2v_model.docvecs.most_similar(positive=[new_dv])
(The Doc2Vec model class is derived from the very-similar Word2Vec class, and thus inherits a most_similar() which by default consults just the internal word-vectors. Those word-vectors might be useful, in some Doc2Vec modes, or random – but it's best to use either d2v_model.wv.most_similar() or d2v_model.docvecs.most_similar() to be clear.)
Basic Doc2Vec examples, like the notebook installed with gensim in the docs/notebooks directory doc2vec-lee.ipynb, contain useful examples.

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.

You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.

You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)

try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.

I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()

I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.