LDA: topic model gensim gives same set of topics - python

Why am I getting same set of topics # words in gensim lda model? I used these parameters. I checked there are no duplicate documents in my corpus.
lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
id2word=WORD_AND_ID,
num_topics=4,
minimum_probability=minimum_probability,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto', # symmetric, asymmetric
per_word_topics=True)
Results
[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]
Notice: Topic #1 and #3 are identical.

Each of the topics likely contains a large number of words weighted differently. When a topic is being displayed (e.g. using lda_model.show_topics()) you are going to get only a few words with the largest weights. This does not mean that there are no differences between topics among the remaining vocabulary.
You can steer the number of displayed words to inspect the remaining weights:
show_topics(num_topics=4, num_words=10, log=False, formatted=True)
and change num_words parameter to include even more words.
Now, there is also a possibility that:
the number of topics should be different (e.g. 3),
or minimum_probability smaller (what is the value you use?),
or number of passes larger,
chunksize smaller,
corpus larger (what is the size?) or stripped off of stop words (did you do that?).
I encourage you to experiment with different values of these parameters to check if any of the combination works better.

you need to change the alpha parameter to 50/i which i is your topics number and use the eta parameter. (eta = 0.1)
like this code :
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4,
update_every=1,
chunksize=100,
passes=10,
alpha=50/4,
eta = 0.1,
per_word_topics=True)

Related

How to select feature sizes

im trying to replicate an experiment on a paper using SVM, to increment my learning/knownledge on machine learning. In this paper, the author extracts the features and chooses the feature sizes. He, then shows a table where F represents the size of the feature vector and N represents the face images
He then works with F >= 9 and N >= 15 parameters.
Now, what i want to do is to actually grab the features i extract as he does in the paper.
Basically, this is how i extract the features:
def load_image_files(fullpath, dimension=(64, 64)):
descr = "A image classification dataset"
images = []
flat_data = []
target = []
dimension=(64, 64)
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for person in os.listdir(path):
personfolder = os.path.join(path, person)
for imgname in os.listdir(personfolder):
class_num = CATEGORIES.index(category)
fullpath = os.path.join(personfolder, imgname)
img_resized = resize(skimage.io.imread(fullpath), dimension, anti_aliasing=True, mode='reflect')
flat_data.append(img_resized.flatten())
images.append(skimage.io.imread(fullpath))
target.append(class_num)
flat_data = np.array(flat_data)
target = np.array(target)
images = np.array(images)
print(CATEGORIES)
return Bunch(data=flat_data,
target=target,
target_names=category,
images=images,
DESCR=descr)
How do i select the amount of features extracted and stored? or how do i manually store a vector with the amount of features that i need? For instance a feature vector of size 9
I'm trying to separate my features this way:
X_train, X_test, y_train, y_test = train_test_split(
image_dataset.data, image_dataset.target, test_size=0.3,random_state=109)
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X_train, y_train)
print(model.feature_importances_)
Though, my output is:
[0. 0. 0. ... 0. 0. 0.]
for SVM classification, im trying to use OneVsRestClassifier
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly", "rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters)
model_tunning
model_tunning.fit(X_train, y_train)
prediction = model_tunning.best_estimator_.predict(X_test)
Then, once i call prediction, i get:
Out[29]:
array([1, 0, 4, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1, 1, 0, 3, 2, 2, 2, 0, 4, 2,
2, 4])
So you've got two arrays of image information (one unprocessed, the other resized and flattened) as well as a list of corresponding class values (which we usually call labels). There are currently 2 things not quite right with the setup, however:
1) What's missing here are multiple features - these might include specific arrays from data associated with feature extraction from morphological/computer vision processes of your images, or they may be ancillary data like a list of preferences, behaviors, purchases. Basically, anything that can act as an array in either a numerical or categorical format. Technically speaking, your resized images are a second feature, but I don't think this will add much if any improvement in model performance.
2) target_names=category in your function return will store the last iteration pf category in CATEGORIES. I don't know if this is what you want.
Going back to your table, N would refer to the number of images in the dataset, and F would be the number of corresponding feature arrays associated with that image. By way of example, let's say we have fifty individual wines and five features (colour, taste, alcohol content, pH, optical density). N of 5 would be five of those wines, and F of 2 would be, say, colour, taste.
If I had to guess at what your features would be, they would in fact be a single feature - the image data itself. Looking at your data structure, every label/category you have will have multiple individuals (people) each with multiple examples of images of that person. Note that multiple individuals are not separate features - the way you're structuring the data, the individuals are grouped together under a single category.
So, where to from here? Without knowing what paper you're reading it's hard to suggest what to do, but I would go back and see if you can perhaps provide us with more information about the problem.

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

How to interpret PCA results in supervised ML

So I have a data set of 700 texts annotated by difficulty levels. Each text has 150 features:
feature_names = ['F1','F2','F3'...] shape (1, 150)
features_x = ['0.1','0,765', '0.543'...] shape (700, 150)
correct_answers_y = ['1','2','4'...] shape (1,700)
I want to use PCA to find out the most informative sets of features, something like:
Component1 = 0,76*F1+0.11*F4-0.22*F7
How can I do so? The code from sklearn user gide have some numbers as output, but I don`t understand how to unterpret them.
fit_xy = pca.fit(features_x,correct_answers_y)
array([ 4.01783322e-01, 1.98421989e-01, 3.08468655e-01,
4.28813755e-02, ...])
Not sure where that array comes from, but it looks like the output of explained_variance_ or explained_variance_ratio_ attributes. They are as they say; explained variance and explained variance ratio compared to your data. Usually when doing a PCA you're defining a minimum of ratio of variance you want to keep from the data.
Lets say you want to keep at least 90% of the variance in your data. Here's code to find how many principle components (n_components parameter in PCA) you need:
pca_cumsum = pca.explained_variance_ratio_.cumsum()
pca_cumsum
>> np.array([.54, .79, .89, .91, .97, .99, 1])
np.argmax(pca_cumsum >= 0.9)
>> 3
And as desertnaut said; labels will be ignored, as it is not used in PCA.

Latent Dirichlet Allocation with prior topic words

Context
I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module.
This works really well, except for the quality of topic words found/selected.
In a article by Li et al (2017), the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belonging to these topics. For these words they set the default value to a high number for the associated topic and 0 for the other topics. All other words (not manually selected for a topic) are given equal values for all topics (1). This matrix of values is used as input for the LDA.
My question
How can I create a similar analysis with the LatentDirichletAllocation module from Scikit-Learn using a customized default values matrix (prior topics words) as input?
(I know there's a topic_word_prior parameter, but it only takes one float instead of a matrix with different 'default values'.)
After taking a look a the source and the docs, it seems to me like the easiest thing to do is subclass LatentDirichletAllocation and only override the _init_latent_vars method. It is the method called in fit to create the components_ attribute, which is the matrix used for the decomposition. By re-implementing this method, you can set it just the way you want, and in particular, boost the prior weights for the related topics/features. You would re-implement there the logic of the paper for the initialization.
Using Anis' help, I created a subclass of the original module, and edited the function that sets the starting values matrix. For all prior topic words you wish to give as input, it transforms the components_ matrix by multiplying the values with the topic values of that (prior) word.
This is the code:
# List with prior topic words as tuples
# (word index, [topic values])
prior_topic_words = []
# Example (word at index 3000 belongs to topic with index 0)
prior_topic_words.append(
(3000, [(np.finfo(np.float64).max/4),0.,0.,0.,0.])
)
# Custom subclass for PTW-guided LDA
from sklearn.utils import check_random_state
from sklearn.decomposition._online_lda import _dirichlet_expectation_2d
class PTWGuidedLatentDirichletAllocation(LatentDirichletAllocation):
def __init__(self, n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None, n_topics=None, ptws=None):
super(PTWGuidedLatentDirichletAllocation, self).__init__(n_components, doc_topic_prior, topic_word_prior, learning_method, learning_decay, learning_offset, max_iter, batch_size, evaluate_every, total_samples, perp_tol, mean_change_tol, max_doc_update_iter, n_jobs, verbose, random_state, n_topics)
self.ptws = ptws
def _init_latent_vars(self, n_features):
"""Initialize latent variables."""
self.random_state_ = check_random_state(self.random_state)
self.n_batch_iter_ = 1
self.n_iter_ = 0
if self.doc_topic_prior is None:
self.doc_topic_prior_ = 1. / self.n_topics
else:
self.doc_topic_prior_ = self.doc_topic_prior
if self.topic_word_prior is None:
self.topic_word_prior_ = 1. / self.n_topics
else:
self.topic_word_prior_ = self.topic_word_prior
init_gamma = 100.
init_var = 1. / init_gamma
# In the literature, this is called `lambda`
self.components_ = self.random_state_.gamma(
init_gamma, init_var, (self.n_topics, n_features))
# Transform topic values in matrix for prior topic words
if self.ptws is not None:
for ptw in self.ptws:
word_index = ptw[0]
word_topic_values = ptw[1]
self.components_[:, word_index] *= word_topic_values
# In the literature, this is `exp(E[log(beta)])`
self.exp_dirichlet_component_ = np.exp(
_dirichlet_expectation_2d(self.components_))
Initiation is the same as the original LatentDirichletAllocation class, but now you can provide prior topic words using the ptws parameter.

Pattern recognition with Pybrain

Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set
Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.

Categories

Resources