Periodically, when I run topic analyses on data and try to visualize using pyLDAvis, I get a validation error: "Not all rows (distributions) in doc_topic_dists sum to 1." Here's some basic code. Some code below:
tfidf_vectorizer = TfidfVectorizer(max_df=.95, min_df=1, max_features=None, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lines2)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
nmf = NMF(n_components=3, random_state=None, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
panel = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer, mds='tsne')
The culprit is the last statement there (the 'panel = ' statement); evidently, the matrix produced by nmf.transform(tfidf) contains some rows that are all zeroes, so the attempt to normalize the rows by centering them on the column means returns nan's. No combination of model parameters seems to fix this (in fact they often seem to make the problem worse by producing more rows with nan's).
FWIW, the data involved is the text of Tweets from BBC Health, so the average response length is fairly short -- a little under 4,000 records, with each record an average of 4.8 words long. Nonetheless I've verified that the zero-weighted responses all include words that were included in the model vocabulary, so I'm unsure why the problem or how to fix it.
If there is no way to fix it, would it be sensible simply to substitute in the column means in these cases?
Related
I have an elementary doubt about the logistic regression model and the vectorized corpora with Tfidf.
Suppose I have a corpus where the word "need" appears in 10 documents. So the value of the "need" column in the Tfidf matrix will be less than if the word only appeared in 5 documents.
The point is that, when adjusting its parameters, the model will have to give a greater weight to the "need" feature to compensate for the low value in the input. And finally, a word with little relevance in the corpus will have a very high weight, precisely due to its low relevance.
To test this I used this code. You'll see that if you add occurrences of "need" when instantiating the model with vectorizer.fit_transform, the value of the "need" column in the tfidf array goes down, and the final weight goes up.
vectorizer = TfidfVectorizer(use_idf=True,stop_words=[])
vectorizer.fit_transform(["he need to get a car","you need to get a car","she need to get a car","i need a beer","please give me a beer"])
vector = vectorizer.transform(["he need to get a car","you need to get a car","she need to get a car","i love you"])
df = pd.DataFrame(vector.todense().tolist(),columns=vectorizer.get_feature_names_out())
from sklearn.linear_model import LogisticRegression
scikit_log_reg = LogisticRegression(verbose=1, solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000)
model = scikit_log_reg.fit(vector,["TRUE","TRUE","TRUE","FALSE"])
weights_df = pd.DataFrame(model.coef_,columns=vectorizer.get_feature_names_out())
df
weights_df
Is my reasoning correct? Is it okay for it to work like this?
I am trying to train word2vec model on a simple toy dateset of 4 sentences.
The Word2vec version that I need is:
Skip-gram model
no negative sampling
no hierarchical soft-max
no removal or down-scaling of frequent words
vector size of words is 2
Window size 4 i.e all the words in a sentence are considered context words of each other.
epochs can be varied from 1 to 500
Problem that I am facing is: No matter how I change the above parameters, the word vectors are not being updated/learned. The word vectors for epochs=1 and epochs=500 are being same.
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
import nltk
# toy dataset with 4 sentences
sents = ['what is the time',
'what is the day',
'what time is the meeting',
'cancel the meeting']
sents = [nltk.word_tokenize(string) for string in sents]
# model initialization and training
model = Word2Vec(alpha=0.5, min_alpha =0.25, min_count = 0, size=2, window=4,
workers=1, sg = 1, hs = 0, negative = 0, sample=0, seed = 42)
model.build_vocab(sents)
model.train(sents, total_examples=4, epochs=500)
# getting word vectors into array
vocab = model.wv.vocab.keys()
vocab_vectors = model.wv[vocab]
print(vocab)
print(vocab_vectors)
#plotting word vectors
plt.scatter(vocab_vectors[:,0], vocab_vectors[:,1], c ="blue")
for i, word in enumerate(vocab):
plt.annotate(word, (vocab_vectors[i,0], vocab_vectors[i,1]))
The out put of print(vocab) is as below
['what', 'is', 'time', 'cancel', 'the', 'meeting', 'day']
The output of print(vocab_vectors) is as below
[[ 0.08136337 -0.05059118]
[ 0.06549312 -0.22880174]
[-0.08925873 -0.124718 ]
[ 0.05645624 -0.03120007]
[ 0.15067646 -0.14344342]
[-0.12645201 0.06202405]
[-0.22905378 -0.01489289]]
The plotted 2D vectors
Why do I think the vectors are not being learned? I am changing the epochs value to 1, 10, 50, 500... and running the whole code to check the output for each run. For epochs = #any_value <1,10,50,500>, the output (vocab, vocab_vectors, and the plot) is being same for all the runs.
By providing the parameters negative=0, hs=0, you've disabled both training modes, and no training is happening.
You should either leave the default non-zero negative value in place, or enable the non-default hierarchical-softmax mode while disabling negative-sampling (with hs=1, negative=0).
Other thoughts:
Enabling logging at the INFO level is often helpful, and might have shown progress output which better hinted to you that no real training was happening
Etill, with a tiny toy dataset, the biggest hint that all training was disabled – suspiciously instant completion of training – is nearly indistinguishable from a tiny amount of training. Generally, lots of things will be weird or disappointing with tiny datasets (& tiny vector sizes), as word2vec's usual benefits really depend on large amounts of text.
Lowering min_count is usually a bad idea with any realistic dataset, as word2vec needs multiple varied examples of a word's usage to train useful vectors – and it's usually better to ignore rare words than mix their incomplete info in.
Changing the default alpha/min_alpha is also usually a bad idea – though perhaps here you were just trying extreme values to trigger any change.
BACKGROUND
At the beginning of my project, the focus was to compare requests/questions received in terms of how they differ in terms of content. I trained a Doc2Vec model and the results were pretty good (for reference, my data included 14 million requests).
class PhrasingIterable():
def __init__(self, my_phraser, texts):
self.my_phraser = my_phraser
self.texts = texts
def __iter__(self):
return iter(self.my_phraser[self.texts])
docs = DocumentIterator()
bigram_transformer = Phrases(docs, min_count=1, threshold=10)
bigram = Phraser(bigram_transformer)
corpus = PhrasingIterable(bigram, docs)
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(window=5,
vector_size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
epochs = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
However, in a second stage, the focus of analysis shifted from requests to individuals per week. To measure how individuals requests differ from week to week I extracted all words from requests in a given week t and compared with all words from requests in the previous week t-1 using d2v_model.wv.n_similarity. Since I need to replicate this in other areas, occurred to me that I was wasting to much memory and time training Doc2Vec models when I could use Word2Vec to get the same measure. Thus, I trained the following Word2Vec model:
docs = DocumentIterator()
bigram_transformer = gensim.models.Phrases(docs, min_count=1, threshold=10)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
sentences = PhrasingIterable(bigram, docs)
model = Word2Vec(window=5,
size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
iter = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
I used again the cosine similarity to compare the content from week to week w2v_model.wv.n_similarity. As a sanity check, I compared the similarities generated by Word2Vec and Doc2vec, the correlation coefficient among is around 0.70 and the scale differs a lot. My implied assumption is that comparing sets of extracted words using d2v_model.wv.n_similarity was taking advantage of the Word2Vec within the trained Doc2Vec.
MY QUESTION
Should cosine similarity measures between two sets of extracted words differ as we trade from Doc2Vec to Word2Vec? If so, why? I not, any suggestions on my code?
I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch
The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
vectorizer.fit_transform(test['message'].values)
with:
vectorizer.transform(test['message'].values)
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.
The tensorflow tutorial on language model allows to compute the probability of sentences :
probabilities = tf.nn.softmax(logits)
in the comments below it also specifies a way of predicting the next word instead of probabilities but does not specify how this can be done. So how to output a word instead of probability using this example?
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word's probabilities.
At "Evaluate the Model" from this page, your output list is y in the following example:
First we'll figure out where we predicted the correct label. tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example, tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1) is the true label. We can use tf.equal to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the "N words" closest to the predicted point in space at the output of the network. In that case, the argmax way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y will not represent words as features, but word embeddings over a dimensionality of, let's say, 100 to 2000 in size according to different models. You could Google something like this for more info: "man woman queen word addition word2vec" to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs' corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.
You need to find the argmax of the probabilities, and translate the index back to a word by reversing the word_to_id map. To get this to work, you must save the probabilities in the model and then fetch them from the run_epoch function (you could also save just the argmax itself). Here's a snippet:
inverseDictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
def run_epoch(...):
decodedWordId = int(np.argmax(logits))
print (" ".join([inverseDictionary[int(x1)] for x1 in np.nditer(x)])
+ " got" + inverseDictionary[decodedWordId] +
+ " expected:" + inverseDictionary[int(y)])
See full implementation here: https://github.com/nelken/tf
It is actually an advantage that the function returns a probability instead of the word itself. Since it is using a list of words, with the associated probabilities, you can do further processing, and increase the accuracy of your result.
To answer your question:
You can take the list of words, iterate though it , and make the program display the word with the highest probability.