doc2vec How to cluster DocvecsArray - python

I've patched the following code from examples I've found over the web:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from random import shuffle
# classifier
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
print(model.docvecs)
my test.txt file contains a paragraph per line.
The code runs fine and generates DocvecsArray for each line of text
my goal is to have an output like so:
cluster 1: [DOC_5,DOC_100,...DOC_N]
cluster 2: [DOC_0,DOC_1,...DOC_N]
I have found the following Answer, but the output is:
cluster 1: [word,word...word]
cluster 2: [word,word...word]
How can I alter the code and get document clusters?

So it looks like you're almost there.
You are outputting a set of vectors. For the sklearn package, you have to put those into a numpy array - using the numpy.toarray() function would probably be best. The documentation for KMeans is really stellar and even across the whole library it's good.
A note for you is that I have had much better luck with DBSCAN than KMeans, which are both contained in the same sklearn library. DBSCAN doesn't require you to specify how many clusters you want to have on the output.
There are well-commented code examples in both links.

In my case I used:
for doc in docs:
doc_vecs = model.infer_vector(doc.split())
# creating a matrix from list of vectors
mat = np.stack(doc_vecs)
# Clustering Kmeans
km_model = KMeans(n_clusters=5)
km_model.fit(mat)
# Get cluster assignment labels
labels = km_model.labels_
# Clustering DBScan
dbscan_model = DBSCAN()
labels = dbscan_model.fit_predict(mat)
Where model is the pre-trained Doc2Vec model. In my case I didn't need to cluster the same documents of the training but new documents saved in the docs list

Related

NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysis

while performing sentiment analysis using data -
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
The dataset contains 25K training and testing data (12.5 Positive and 12.5 Negative reviews)
I'm constantly getting -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Code -
(Required libraries and Variable names are initialized separately)
To create training and testing data -
import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
texts,labels = [],[]
for idx,label in enumerate(folders):
for fname in glob.glob(os.path.join(path, label, '*.*')):
texts.append(open(fname, 'r',encoding="utf8").read())
labels.append(idx)
# stored as np.int8 to save space
return texts, np.array(labels).astype(np.int8)
trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)
len(trn),len(trn_y),len(val),len(val_y)
len(trn_y[trn_y==1]),len(val_y[val_y==1])
np.unique(trn_y)
Count Vectorization -
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()
#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]
Here i get Error -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
vocab = loaded_vectorizer.get_feature_names()
loaded_vectorizer is not defined anywhere in this code, so it's not surprising that it's not initialized.
Also why do you initialize veczr twice? Apparently you don't use it the second time.

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

Pattern recognition with Pybrain

Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set
Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.

how to show the document ID after get TF-IDF ,cosine_similarity? python

I calculate TF-IDF for both query string and a few documents.
I would like to calculate the cosine similarity and display the document ID list from the most relevant to query to less relevant.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## load the documents (around 200 txt) from path
cranInp=[]
path="D:\\Desktop\\try\\web"
for file in os.listdir(path):
textdir=path+"\\"+file
f=open(textdir).read()
# print f
cranInp.append(f)
Vcount = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(cranInp)
Query = "in summarizing theoretical and experimental work on the behaviour of a typical aircraft structure in a noise environment is it possible to develop a design procedure ."
queryVects = Vcount.transform(Query)
k = 50
cosMattf = cosine_similarity(queryVects,countMatrix)
how to get the list of top K ( k=50) document such as [12.txt,34.txt,89.txt,90.txt....45.txt] the size of the list is 50.
from most relevant to less relevant such as 12.txt has the lowest cosine distance and it is the most relevant document to the query.

Feature Importance extraction of Decision Trees (scikit-learn)

I've been trying to get a grip on the importance of features used in a decision tree i've modelled. I'm interested in discovering the weight of each feature selected at the nodes as well as the term itself. My data is a bunch of documents.
This is my code for the decision tree, I modified the code snippet from scikit-learn that extract (http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html):
from sklearn.feature_extraction.text import TfidfVectorizer
### Feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,
use_idf=True, tokenizer=None, ngram_range=(1,2))#ngram_range=(1,0)
tfidf_matrix = tfidf_vectorizer.fit_transform(data[:, 1])
terms = tfidf_vectorizer.get_features_names()
### Define Decision Tree and fit
dtclf = DecisionTreeClassifier(random_state=1234)
dt = data.copy()
y = dt["label"]
X = tfidf_matrix
fitdt = dtclf.fit(X, y)
from sklearn.datasets import load_iris
from sklearn import tree
### Visualize Devision Tree
with open('data.dot', 'w') as file:
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
file.close()
import subprocess
subprocess.call(['dot', '-Tpdf', 'data.dot', '-o' 'data.pdf'])
### Extract feature importance
importances = dtclf.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print('Feature Ranking:')
for f in range(tfidf_matrix.shape[1]):
if importances[indices[f]] > 0:
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
print ("feature name: ", terms[indices[f]])
Am I correct in assuming that using terms[indices[f]] (which is the feature term vector ) will print the actual feature term used to split the tree at a certain node?
The decision tree visualised with GraphViz has for instance X[30], I'm assuming this is refers to the numerical interpretation of the feature term. How do I extract the term itself so I can validate the process I deployed in #1?
Updated code
fitdt = dtclf.fit(X, y)
with open(...):
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
Thanks in advance
For you first question you need to get the feature names out of the vectoriser with terms = tfidf_vectorizer.get_feature_names(). For your second question, you can you can call export_graphviz with feature_names = terms to get the actual names of your variables to appear in your visualisation (check out the full documentation of export_graphviz for many other options that may be useful for improving your visualisation.

Categories

Resources