inverse ratio between weights and frequent words when using tfidf?

inverse ratio between weights and frequent words when using tfidf? - python

I have an elementary doubt about the logistic regression model and the vectorized corpora with Tfidf.
Suppose I have a corpus where the word "need" appears in 10 documents. So the value of the "need" column in the Tfidf matrix will be less than if the word only appeared in 5 documents.
The point is that, when adjusting its parameters, the model will have to give a greater weight to the "need" feature to compensate for the low value in the input. And finally, a word with little relevance in the corpus will have a very high weight, precisely due to its low relevance.
To test this I used this code. You'll see that if you add occurrences of "need" when instantiating the model with vectorizer.fit_transform, the value of the "need" column in the tfidf array goes down, and the final weight goes up.
vectorizer = TfidfVectorizer(use_idf=True,stop_words=[])
vectorizer.fit_transform(["he need to get a car","you need to get a car","she need to get a car","i need a beer","please give me a beer"])
vector = vectorizer.transform(["he need to get a car","you need to get a car","she need to get a car","i love you"])
df = pd.DataFrame(vector.todense().tolist(),columns=vectorizer.get_feature_names_out())
from sklearn.linear_model import LogisticRegression
scikit_log_reg = LogisticRegression(verbose=1, solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000)
model = scikit_log_reg.fit(vector,["TRUE","TRUE","TRUE","FALSE"])
weights_df = pd.DataFrame(model.coef_,columns=vectorizer.get_feature_names_out())
df
weights_df
Is my reasoning correct? Is it okay for it to work like this?

Related

How to get the nearest documents for a word in gensim in python

I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.

Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.

NMF yields all-zero weights

Periodically, when I run topic analyses on data and try to visualize using pyLDAvis, I get a validation error: "Not all rows (distributions) in doc_topic_dists sum to 1." Here's some basic code. Some code below:
tfidf_vectorizer = TfidfVectorizer(max_df=.95, min_df=1, max_features=None, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lines2)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
nmf = NMF(n_components=3, random_state=None, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
panel = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer, mds='tsne')
The culprit is the last statement there (the 'panel = ' statement); evidently, the matrix produced by nmf.transform(tfidf) contains some rows that are all zeroes, so the attempt to normalize the rows by centering them on the column means returns nan's. No combination of model parameters seems to fix this (in fact they often seem to make the problem worse by producing more rows with nan's).
FWIW, the data involved is the text of Tweets from BBC Health, so the average response length is fairly short -- a little under 4,000 records, with each record an average of 4.8 words long. Nonetheless I've verified that the zero-weighted responses all include words that were included in the model vocabulary, so I'm unsure why the problem or how to fix it.
If there is no way to fix it, would it be sensible simply to substitute in the column means in these cases?

Using Naive Bayes for spam detection

I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch

The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
vectorizer.fit_transform(test['message'].values)
with:
vectorizer.transform(test['message'].values)
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.

sklearn LogisticRegression classifier performance varies with same element values but different hash-range sparse matrix

I was trying to train a lr classifier against text dataset, different from common scene where text data directly feed to tfidf vectorizer, orginal text line was first transformed into dictionary like {a:0.1, phrase:0.5, in:0.3, line:0.8}, in which weights were computed due to some specific rules and some words were omitted. so, in order to feed these dictionaries to lr classifier, I chose FeatureHasher to do the hash trick. However, I found the lr classifier worked extremely slow when the n_features param of FeatureHasher grew large, say 10^8.
But as far as I know, both memory-cost and calculation-cost of sparse matrix should not grow with dimensions while the number of valid elements is fixed. For example, if we have a two-element sparse vector [coordinate:(1,2), value:(3,4)], where its original dimension is 10. we change the hash-range to 20, and we get [(3,7), (3,4)], there is no difference in storing these two vectors, and if we calculate its distance with another sparse vector, we only need to traverse to list with fixed number of elements therefore calculation-cost if fixed.
I think there must be something wrong with my understanding, or I should have missed something with the lr classifier of sklearn, hope someone would correct me, thanks!

Predicting next word using the language model tensorflow example

The tensorflow tutorial on language model allows to compute the probability of sentences :
probabilities = tf.nn.softmax(logits)
in the comments below it also specifies a way of predicting the next word instead of probabilities but does not specify how this can be done. So how to output a word instead of probability using this example?
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)

Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word's probabilities.
At "Evaluate the Model" from this page, your output list is y in the following example:
First we'll figure out where we predicted the correct label. tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example, tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1) is the true label. We can use tf.equal to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the "N words" closest to the predicted point in space at the output of the network. In that case, the argmax way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y will not represent words as features, but word embeddings over a dimensionality of, let's say, 100 to 2000 in size according to different models. You could Google something like this for more info: "man woman queen word addition word2vec" to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs' corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.

You need to find the argmax of the probabilities, and translate the index back to a word by reversing the word_to_id map. To get this to work, you must save the probabilities in the model and then fetch them from the run_epoch function (you could also save just the argmax itself). Here's a snippet:
inverseDictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
def run_epoch(...):
decodedWordId = int(np.argmax(logits))
print (" ".join([inverseDictionary[int(x1)] for x1 in np.nditer(x)])
+ " got" + inverseDictionary[decodedWordId] +
+ " expected:" + inverseDictionary[int(y)])
See full implementation here: https://github.com/nelken/tf

It is actually an advantage that the function returns a probability instead of the word itself. Since it is using a list of words, with the associated probabilities, you can do further processing, and increase the accuracy of your result.
To answer your question:
You can take the list of words, iterate though it , and make the program display the word with the highest probability.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.