I always got error when using TfidfVectorizer for kmeans clustering.
There are 3 cases:
I use tokenizer parameter in TfidfVectorizer to customize tokenization process for my dataset. Here is my code:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize)
X = vectorizer.fit_transform(titles)
however i got this error :
ValueError: empty vocabulary; perhaps the documents only contain stop words
I make a vocabulary consisting terms that was the result of the tokenization, so the code become like this:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize,vocabulary=vocab)
but i got another new error:
ValueError: Vocabulary contains repeated indices.
And lastly, i remove the tokenizer and vocabulary parameter. The code becomes like this:
vectorizer = TfidfVectorizer(stop_words=stops)
X = vectorizer.fit_transform(titles)
terms = vectorizer.get_feature_names()
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print "Top terms per cluster:"
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
Well, the program runs successfully but the clustering results are like this:
Cluster 0: bangun, rancang, lunak, perangkat, aplikasi, berbasis, menggunakan, service, sistem, pembangunan,
Cluster 1: sistem, aplikasi, berbasis, web, menggunakan, pembuatan, mobile, informasi, teknologi, pengembangan,
Cluster 2: android, berbasis, aplikasi, perangkat, rancang, bangun, bergerak, mobile, sosial, menggunakan,
Cluster 3: implementasi, algoritma, menggunakan, klasifikasi, data, game, fuzzy, vector, support, machine,
Cluster 4: metode, menggunakan, video, penerapan, implementasi, steganografi, pengenalan, berbasis, file, analisis,
Cluster 5: citra, segmentasi, menggunakan, implementasi, metode, warna, tekstur, kembali, berwarna, temu,
Cluster 6: jaringan, tiruan, protokol, voip, syaraf, saraf, menggunakan, implementasi, kinerja, streaming,
Cluster 7: studi, kasus, its, informatika, teknik, sistem, informasi, data, surabaya, jurusan,
Some terms are clustered into multiple clusters,like term data is placed to Cluster 3 and Cluster 7.
Can you tell me how to us the tfidfvectorizer and KMeans properly? Your help is my happiness :)
Related
I am quite new at programming and word2vec models and I would really appreciate your help.
I have performed a BoW analysis on my data and have obtained the top 100 most predictive words. Now, I want to run those words through a pretrained word2vec model and organize them into different clusters using K means. The main goal is to establish the most predictive clusters. However, when I do the clustering, something goes wrong as my model gives me clusters containing single letters (though my 100 words are actual words and my data does not have single letters).
Below is a snippet of the code I am using (note that I have used a tutorial that has similar steps as these):
text = np.take(vectorizer2.get_feature_names(), pos_class_prob_sorted[:100]) #these are the extracted 100 predictive words from the BoW analysis
text = str(text)
tokenized_docs = word_tokenize(text)
tokenized_docs = list(tokenized_docs)
list_of_docs = text
#Trying to make clusters
def vectorize(list_of_docs, model=wv):
"""Generate vectors for list of documents using a Word Embedding
Args:
list_of_docs: List of documents
model: Gensim's Word Embedding
Returns:
List of document vectors
"""
features = []
for tokens in list_of_docs:
zero_vector = np.zeros(model.vector_size)
vectors = []
for token in tokens:
if token in wv:
try:
vectors.append(wv[token])
except KeyError:
continue
if vectors:
vectors = np.asarray(vectors)
avg_vec = vectors.mean(axis=0)
features.append(avg_vec)
else:
features.append(zero_vector)
return features
vectorized_docs = vectorize(tokenized_docs,model=wv)
len(vectorized_docs), len(vectorized_docs[0])
def mbkmeans_clusters(
X,
k,
mb,
print_silhouette_values,
):
"""Generate clusters and print Silhouette metrics using MBKmeans
Args:
X: Matrix of features.
k: Number of clusters.
mb: Size of mini-batches.
print_silhouette_values: Print silhouette values per cluster.
Returns:
Trained clustering model and labels based on X.
"""
km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
print(f"For n_clusters = {k}")
print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
print(f"Inertia:{km.inertia_}")
if print_silhouette_values:
sample_silhouette_values = silhouette_samples(X, km.labels_)
print(f"Silhouette values:")
silhouette_values = []
for i in range(k):
cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append(
(
i,
cluster_silhouette_values.shape[0],
cluster_silhouette_values.mean(),
cluster_silhouette_values.min(),
cluster_silhouette_values.max(),
)
)
silhouette_values = sorted(
silhouette_values, key=lambda tup: tup[2], reverse=True
)
for s in silhouette_values:
print(
f" Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
)
return km, km.labels_
clustering, cluster_labels = mbkmeans_clusters(
X=vectorized_docs,
k=10,
mb=500,
print_silhouette_values=True,
)
df_clusters = pd.DataFrame({
"tokens": [" ".join(text) for text in tokenized_docs],
"cluster": cluster_labels
})
print("Most representative terms per cluster (based on centroids):")
for i in range(10):
tokens_per_cluster = ""
most_representative = wv.most_similar(positive=[clustering.cluster_centers_[i]], topn=5)
for t in most_representative:
tokens_per_cluster += f"{t[0]} "
print(f"Cluster {i}: {tokens_per_cluster}")
Can you please help me determine where am I going wrong with this code? For me, it seems that the model does not actually take those top 100 words into account and it seems like they are not going through the word2vec model.
I have a KMeans clustering script and it organises some documents based on the contents of the text. The documents fall into 1 of 3 clusters, but it seems very YES or NO, I'd like to be able to see how releveant to the cluster each document is.
eg. Document A is in Cluster 1 90% matching, Document B is in Cluster 1 but 45% matching.
Therefore I can create some sort of threshold to say, I only want documents 80% or higher.
dict_of_docs = {'Document A':'some text content',...'Document Z':'some more text content'}
# Vectorizing the data, my data is held in a Dict, so I just want the values.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
X = X.toarray()
# 3 Clusters as I know that there are 3, otherwise use Elbow method
# Then add the vectorized data to the Vocabulary
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X)
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# Print out top terms for each cluster
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
for doc in dict_of_docs:
text = dict_of_docs[doc]
Y = vectorizer.transform([text])
prediction = km.predict(Y)
print(prediction, doc)
I don't believe it is possible to do exactly what you want because k-means is not really a probabilistic model and its scikit-learn implementation (which is what I'm assuming you're using) just doesn't provide the right interface.
One option I'd suggest is to use the KMeans.score method, which does not provide a probabilistic output but provides a score that is larger the closer a point is to the closest cluster. You could threshold by this, such as by saying "Document A is in cluster 1 with a score of -.01 so I keep it" or "Document B is in cluster 2 with a score of -1000 so I ignore it".
Another option is to used the GaussianMixture model instead. A gaussian mixture is a very similar model to k-means and it provides the probabilities you want with GaussianMixture.predict_proba.
Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set
Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.
I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus.
Now I want to analyse my selected corpus, especially I want to know frequency of tokens in selected vocabulary. But I am unable to find an easy way to do it. So kindly help me in this regard.
When you call fit_transform() a sparse matrix will be returned.
To display it you simply have to call the toarray() method.
vec = CountVectorizer()
spars_mat = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
#you can observer the matrix in the interpretor by doing
spars_mat.toarray()
With the help of #bernard post, I am able to completely get the result, which is as follows:
vec = CountVectorizer()
doc_term_matrix = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
doc_term_matrix = doc_term_matrix.toarray()
term_freq_matrix = doc_term_matrix.sum(0)
min_freq = np.amin(term_freq_matrix)
indices_name_mapping = vec.get_feature_names()
feature_names = [indices_name_mapping[i] for i, x in enumerate(term_freq_matrix) if x == min_freq]
I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.
{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}
This is what i have tried so far,
import numpy as np
import json
from sklearn.naive_bayes import *
topic = []
question = []
excerpt = []
with open('training.json') as f:
for line in f:
data = json.loads(line)
topic.append(data["topic"])
question.append(data["question"])
excerpt.append(data["excerpt"])
unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]
x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( ['hello'] )
But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?
All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.
vect = TfidfVectorizer()
X = vect.fit_transform(X)
Obviously, you'll need to vectorize your test set in the same way
clf.predict( vect.transform(['hello']) )
See a tutorial on using sklearn with textual data.