Context
I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module.
This works really well, except for the quality of topic words found/selected.
In a article by Li et al (2017), the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belonging to these topics. For these words they set the default value to a high number for the associated topic and 0 for the other topics. All other words (not manually selected for a topic) are given equal values for all topics (1). This matrix of values is used as input for the LDA.
My question
How can I create a similar analysis with the LatentDirichletAllocation module from Scikit-Learn using a customized default values matrix (prior topics words) as input?
(I know there's a topic_word_prior parameter, but it only takes one float instead of a matrix with different 'default values'.)
After taking a look a the source and the docs, it seems to me like the easiest thing to do is subclass LatentDirichletAllocation and only override the _init_latent_vars method. It is the method called in fit to create the components_ attribute, which is the matrix used for the decomposition. By re-implementing this method, you can set it just the way you want, and in particular, boost the prior weights for the related topics/features. You would re-implement there the logic of the paper for the initialization.
Using Anis' help, I created a subclass of the original module, and edited the function that sets the starting values matrix. For all prior topic words you wish to give as input, it transforms the components_ matrix by multiplying the values with the topic values of that (prior) word.
This is the code:
# List with prior topic words as tuples
# (word index, [topic values])
prior_topic_words = []
# Example (word at index 3000 belongs to topic with index 0)
prior_topic_words.append(
(3000, [(np.finfo(np.float64).max/4),0.,0.,0.,0.])
)
# Custom subclass for PTW-guided LDA
from sklearn.utils import check_random_state
from sklearn.decomposition._online_lda import _dirichlet_expectation_2d
class PTWGuidedLatentDirichletAllocation(LatentDirichletAllocation):
def __init__(self, n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None, n_topics=None, ptws=None):
super(PTWGuidedLatentDirichletAllocation, self).__init__(n_components, doc_topic_prior, topic_word_prior, learning_method, learning_decay, learning_offset, max_iter, batch_size, evaluate_every, total_samples, perp_tol, mean_change_tol, max_doc_update_iter, n_jobs, verbose, random_state, n_topics)
self.ptws = ptws
def _init_latent_vars(self, n_features):
"""Initialize latent variables."""
self.random_state_ = check_random_state(self.random_state)
self.n_batch_iter_ = 1
self.n_iter_ = 0
if self.doc_topic_prior is None:
self.doc_topic_prior_ = 1. / self.n_topics
else:
self.doc_topic_prior_ = self.doc_topic_prior
if self.topic_word_prior is None:
self.topic_word_prior_ = 1. / self.n_topics
else:
self.topic_word_prior_ = self.topic_word_prior
init_gamma = 100.
init_var = 1. / init_gamma
# In the literature, this is called `lambda`
self.components_ = self.random_state_.gamma(
init_gamma, init_var, (self.n_topics, n_features))
# Transform topic values in matrix for prior topic words
if self.ptws is not None:
for ptw in self.ptws:
word_index = ptw[0]
word_topic_values = ptw[1]
self.components_[:, word_index] *= word_topic_values
# In the literature, this is `exp(E[log(beta)])`
self.exp_dirichlet_component_ = np.exp(
_dirichlet_expectation_2d(self.components_))
Initiation is the same as the original LatentDirichletAllocation class, but now you can provide prior topic words using the ptws parameter.
Related
Why am I getting same set of topics # words in gensim lda model? I used these parameters. I checked there are no duplicate documents in my corpus.
lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
id2word=WORD_AND_ID,
num_topics=4,
minimum_probability=minimum_probability,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto', # symmetric, asymmetric
per_word_topics=True)
Results
[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]
Notice: Topic #1 and #3 are identical.
Each of the topics likely contains a large number of words weighted differently. When a topic is being displayed (e.g. using lda_model.show_topics()) you are going to get only a few words with the largest weights. This does not mean that there are no differences between topics among the remaining vocabulary.
You can steer the number of displayed words to inspect the remaining weights:
show_topics(num_topics=4, num_words=10, log=False, formatted=True)
and change num_words parameter to include even more words.
Now, there is also a possibility that:
the number of topics should be different (e.g. 3),
or minimum_probability smaller (what is the value you use?),
or number of passes larger,
chunksize smaller,
corpus larger (what is the size?) or stripped off of stop words (did you do that?).
I encourage you to experiment with different values of these parameters to check if any of the combination works better.
you need to change the alpha parameter to 50/i which i is your topics number and use the eta parameter. (eta = 0.1)
like this code :
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4,
update_every=1,
chunksize=100,
passes=10,
alpha=50/4,
eta = 0.1,
per_word_topics=True)
The solution I found that works in my case is posted below. Hope this helps someone.
How would I concatenate the output of TF-IDF created with sklearn to be passed into a Keras model or tensor that could then be fed into a dense neural network? I'm working on the FakeNewsChallenge dataset. Any guidance would be helpful.
The FakeNewsChallenge dataset is as such:
Training Set - [Headline, Body text, label]
Training Set is split into two different CSVs (train_bodies, train_stances) and are linked by BodyIDs.
train_bodies - [Body ID (num), articleBody (text)]
train_stances - [Headline (text), Body ID (num), Stance (text)]
Test Set - [Headline, Bodytext]
Test set is split into two different CSVs (test_stances_inlabled, test_bodies]
Test_bodies - [Body ID, aritcleBody]
Test_stances_unlabled - [Headline, Body ID]
Distribution makes it extremely hard:
rows - 49972
unrelated - 0.73131
discuss - 0.17828
agree - 0.076012
disagree - 0.0168094
Stance - [ unrelated, discuss, agree, disagree]
What I would like to do is concatenate two separate TF-IDF Vectors as well as other features that I can then feed into a some layer for instance a dense layer. How would you go about that? I
There was a comment prior to mine that answered the question but I do not see the comment anymore. I apparently forgot about this method, but was using it in other areas of my program.
You use the numpy.hstack(tup) or numpy.vstack(tup), where
tup - sequence of ndarrays
The arrays must have the same shape along all but the second axis, except 1-D arrays which can be any length.
It returns a stacked: ndarray.
Here is some code just incase.
Note: I do not have cosine similarity calculation here. Do that however you want. I'm trying to do this fast but also as clear as possible. Hope this helps someone.
def computeTF_IDF(trainX1, trainX2, testX1, testX2):
vectorX1 = TfidfVectorizer(....)
tfidfX1 = vectorX1.fit_Trasnsform(trainX1)
vectorX2 = TfidfVectorizer(....)
tfidfX2 = vectorX2.fit_Trasnsform(trainX2)
tfidf_testX1= vec_body.transform(testX1)
tfidf_testX2 = vec_headline.transform(testX2)
# Optionally, you can insert code from * to ** here from below.
return vectorX1, tfidfX1, ... , tfidf_testX1, tfidf_testX2
# Call TF-IDF function to compute.
trainX1_tfidf, trainX2_tfidf, testX1_tfidf , testX2_tfidf = computeTFIDF(trainX1,...,testX2)
#*
# Stack matrices horizontally (column wise) using hstack().
trainX_tfidf = scipy.sparse.hstack([trainX1_tfidf, trainX2_tfidf])
testX_tfidf = scipy.sparse.hstack([testX1_tfidf, testX2_tfidf])
# Convert Spare Matrix into an Array using toarray()
trainX_tfidf_arr = trainX_tfidf.toarray()
testX_tfidf_arr = testX_tfidf.toarray()
# Concatenate TF-IDF and Cosine Similarity using numpy.c_[],
# which is just another column stack.
trainX_tfidf_cos = np.c_[trainX_tfidf_arr, cosine_similarity]
testX_tfidf_cos = np.c_[testX_tfidf_arr, cosine_similarity_test]
#**
# You can now pass this to your Keras model.
I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)
I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus.
Now I want to analyse my selected corpus, especially I want to know frequency of tokens in selected vocabulary. But I am unable to find an easy way to do it. So kindly help me in this regard.
When you call fit_transform() a sparse matrix will be returned.
To display it you simply have to call the toarray() method.
vec = CountVectorizer()
spars_mat = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
#you can observer the matrix in the interpretor by doing
spars_mat.toarray()
With the help of #bernard post, I am able to completely get the result, which is as follows:
vec = CountVectorizer()
doc_term_matrix = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
doc_term_matrix = doc_term_matrix.toarray()
term_freq_matrix = doc_term_matrix.sum(0)
min_freq = np.amin(term_freq_matrix)
indices_name_mapping = vec.get_feature_names()
feature_names = [indices_name_mapping[i] for i, x in enumerate(term_freq_matrix) if x == min_freq]
I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.
{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}
This is what i have tried so far,
import numpy as np
import json
from sklearn.naive_bayes import *
topic = []
question = []
excerpt = []
with open('training.json') as f:
for line in f:
data = json.loads(line)
topic.append(data["topic"])
question.append(data["question"])
excerpt.append(data["excerpt"])
unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]
x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( ['hello'] )
But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?
All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.
vect = TfidfVectorizer()
X = vect.fit_transform(X)
Obviously, you'll need to vectorize your test set in the same way
clf.predict( vect.transform(['hello']) )
See a tutorial on using sklearn with textual data.