I am trying to do a classification exercise on email docs (strings containing words).
I defined the distance function as following:
def distance(wordset1, wordset2):
if len(wordset1) < len(wordset2):
return len(wordset2) - len(wordset1)
elif len(wordset1) > len(wordset2):
return len(wordset1) - len(wordset2)
elif len(wordset1) == len(wordset2):
return 0
However, the accuracy in the end is pretty low (0.8). I guess this is because of the not so accurate distance function. How can I improve the function? Or what are other ways to calculate the "distance" between email docs?
One common measure of similarity for use in this situation is the Jaccard similarity. It ranges from 0 to 1, where 0 indicates complete dissimilarity and 1 means the two documents are identical. It is defined as
wordSet1 = set(wordSet1)
wordSet2 = set(wordSet2)
sim = len(wordSet1.intersection(wordSet2))/len(wordSet1.union(wordSet2))
Essentially, it is the ratio of the intersection of the sets of words to the ratio of the union of the sets of words. This helps control for emails that are of different sizes while still giving a good measure of similarity.
You didn't mention the type of wordset1 and wordset2. I'll assume they are both strings.
You defined your distance as the word counting and got a bad score. It's obvious text length is not a good dissimilarity measure: two emails with different sizes can talk about the same thing, while two emails of same size be talking about completely different things.
So, as suggested above, you could try and check for SIMILAR WORDS instead:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
return 1 / len(common_words)
else:
# They don't share any word. They are infinitely different.
return np.inf
The problem with that is that two big emails are more likely to share words than two small ones, and this metric would favor those, making them "more similar to each other" in comparison to the small ones. How do we solve this? Well, we normalize the metric somehow:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
# The distance, normalized by the total
# number of different words in the emails.
return 1 / len(common_words) / (len(wordset1 | wordset2))
else:
# They don't share any word. They are infinitely different.
return np.inf
This seems cool, but completely ignores the FREQUENCY of the words. To account for this, we can use the Bag-of-words model. That is, create a list of all possible words and histogram their appearance in each document. Let's use CountVectorizer implementation from scikit-learn to make our job eaiser:
from sklearn.feature_extraction.text import CountVectorizer
def distance(wordset1, wordset2):
model = CountVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
# uses Euclidean distance between bags.
return np.linalg.norm(X[0] - X[1])
But now consider two pairs of emails. The emails in the first pair are composed by perfectly written English, full of "small" words (e.g. a, an, is, and, that) necessary for it to be grammatically correct. The emails in the second pair are different: only containing the keywords, it's extremely dry. You see, chances are the first pair will be more similar than the second one. That happens because we are currently accounting for all words the same, while we should be prioritizing the MEANINGFUL words in each text. To do that, let's use term frequency–inverse document frequency. Luckly, there's a very similar implementation in scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
def distance(wordset1, wordset2):
model = TfidfVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
similarity_matrix = X.dot(X.T)
# The dissimilarity between samples wordset1 and wordset2.
return 1-similarity_matrix[0, 1]
Read more about this in this question. Also, duplicate?
You should now have a fairly good accuracy. Try it out. If it's still not as good as you want, then we have to go deeper... (get it? Because... Deep-learning). The first thing is that we need either a dataset to train over or an already trained model. That's required because networks have many parameters that MUST be adjusted in order to provide useful transformations.
What's been missing so far is UNDERSTANDING. We histogrammed the words, striping them from any context or meaning. Instead, let's keep them where they are and try to recognize blocks of patterns. How can this be done?
Embed the words into numbers, which will deal with the different sizes of words.
Pad every number (word embed) sequence to a single length.
Use convolutional networks to extract meaninful features from sequences.
Use fully-connected networks to project the features extracted to a space that minimizes the distance between similar emails and maximizes the distance between non-similar ones.
Let's use Keras to simply our lives. It should look something like this:
# ... imports and params definitions
model = Sequential([
Embedding(max_features,
embedding_dims,
input_length=maxlen,
dropout=0.2),
Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1),
MaxPooling1D(pool_length=model.output_shape[1]),
Flatten(),
Dense(256, activation='relu'),
])
# ... train or load model weights.
def distance(wordset1, wordset2):
global model
# X = ... # Embed both emails.
X = sequence.pad_sequences(X, maxlen=maxlen)
y = model.predict(X)
# Euclidean distance between emails.
return np.linalg.norm(y[0]-y[1])
There's a practical example on sentence processing which you can check Keras github repo. Also, someone solves this exact same problem using a siamese recurrent network in this stackoverflow question.
Well, I hope this gives you some direction. :-)
Related
I have trained several word2vec models using gensim for different languages, but the size is different for each of them.
vectors are obtained like this:
vec_sp = word_vectors_sp.get_vector("uno")
How to use vec_sp as input for different model with different vector size:
word_vectors_en.most_similar(positive=[vec_sp], topn=1)
to obtain the corresponding word in the second model
If the models were trained separately, even if they had the same number of dimensions, the vectors wouldn't be meaningfully comparable.
It is only the interleaved training tug-of-war, between words that are being learned for the same model from mixes of varied contexts, that causes their end-positions to have meaningful distances.
For example, even if both models are for the same language, and include many similar text examples, the word 'apple' in one, and the word 'apple' in another, could wind up in arbitrarily different final positions – thanks to both random initialization & lots of randomization use during the algorithm's operation. The distance/direction between these positions is essentially meaningless. The only consistency that should be expected is that from training material of roughly similar quality/sufficiency, the word's neighbors should be very similar.
If two models do contain many of the same words, there is a possibility to separately learn a "translation" between the two spaces, in a separate optimization process. It takes a large number of shared anchor words, learns a mathematical transformation that does a fair job of moving words from one coordinate space to another, and then that same transformation can be applied to words that aren't in both models.
This technique has had some success in suggesting similar words in a another language in machine translation, and there's some example implementing code in the gensim library's TranslationMatrix class:
https://radimrehurek.com/gensim/models/translation_matrix.html
(It's usually used between models of the same dimensionality but it might work more generally.)
The systematic approach to the problem of being able to compare across n different embedded vector spaces with different dimensions d_1,...d_n is to reduce the dimensionality of the vectors in each space to a value m where m < min(d_1,...d_n).
There are many ways of doing it. By Johnson Lindenstrauss lemma, you could do it by applying random projections in each space separately, i.e.,
choose a random projection matrix Ri of size m x d_i for each set of vectors Xi_{d_i} x N (assuming each space has N vectors) and then compute
Xi' _{m x N} = Ri _{m x d_i} x Xi _{d_i x N} (dimensions of the matrix are shown alongside).
After applying this transformation for each space, you will end up with i such spaces... the dimension of each will be m, which means that you will be able to compute dot products between them.
One more approach for dimensionality reduction is to use PCA. Python's sklearn provides implementations for both random projections and PCA.
In terms of a concrete example, if you have two vector spaces of 100 and 200 dimensions each with 100,000 vectors then reduce each to 20 dimensions (arbitrarily chosen) by PCA or random projection. You would then be able to compare these 20 dimensional vectors by computing distances or inner products.
I have a set of documents (3000) which each contain a short description. I want to use Word2Vec model to see if I can cluster these documents based on the description.
I'm doing it the in the following way, but I am not sure if this is a "good" way to do it. Would love to get feedback.
I'm using Google's trained w2v model.
wv = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,encoding="ISO-8859-1", limit = 100000)
Each document is split into words where stop words are removed, and I have used stemming as well.
My initial idea was to fetch the word vector for each word in each documents description, average it, and then cluster based on this.
doc2vecs = []
for i in range(0, len(documents_df['Name'])):
vec = [0 for k in range(300)]
for j in range(0, len(documents_df['Description'][i])):
if documents_df['Description'][i][j] in wv:
vec += wv[documents_df['Description'][i][j]]
doc2vecs.append(vec/300)
I'm then finding similarities using
similarities = squareform(pdist(doc2vecs, 'cosine'))
Which returns a matrix of the cosine between each vector in doc2vec.
I then try to cluster the documents.
num_clusters = 2
km = cluster.KMeans(n_clusters=num_clusters)
km.fit(doc2vecs)
So basically what I am wondering is:
Is this method of clustering the average word vector for each word in the document a reasonable way to cluster the documents?
In 2019, unless you have serious resource constraints, you don't need to vectorize documents by averaging word embeddings. You can use Universal Sentence Encoder to vectorize documents in a few lines of code.
Most clustering algorithms do better in low dimensions, so from here you want to do dimensionality reduction, then clustering. AFAIK, you'll get the best results from UMAP. Their docs explain how to do this very clearly.
I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)
Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.
I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.
So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.
so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?
thank you for your answer, i hope i have the solution :)
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result = []
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.
However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.
I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered # eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.
A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?
Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?
What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?
Here's the code I'm using to perform DBSCAN:
import sys
import gensim
import json
from optparse import OptionParser
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# snip option parsing
model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])
db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)
output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))
I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:
Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem
Iterating the creation of the model also helped me understand better which parameters to choose:
for eps in np.arange(0.1, 50, 0.1):
dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
labels = dbscan_model.fit_predict(mat_words)
clusters = {}
for i, w in enumerate(words_found):
clusters[w] = labels[i]
dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = len([lab for lab in labels if lab == -1])
print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)
As far as I can tell from the various visualizations of word2vec, the vectors probably won't cluster well.
First of all, there is nothing in the word2vec objective that would encourage clustering. On the contrary, it optimizes words to resemble the neighbors, so nearby words will get similar vectors. That is necessary for the word substitution aim.
Secondly, based on the plots, I am not sure there are "dense" regions separated by areas of low density in there. Instead, the data usually more looks like one big blob. But when almost all the vectors are in that big blob, they will almost all be in the same cluster!
Last but not least, most words probably don't cluster. Yes, numbers will likely cluster. You'd expect verbs to cluster vs. nouns, but "to bear" and "a bear" is the same to word2vec, and so is "bar" (verb and noun) etc. - there are too many polysemies for such clusters to be well separated even if the embedding were perfect!
Your best guess is to increase minors and lower epsilon until most data is noise, and you find some remaining clusters.
I'm new to SVMs, and I'm trying to use the Python interface to libsvm to classify a sample containing a mean and stddev. However, I'm getting nonsensical results.
Is this task inappropriate for SVMs or is there an error in my use of libsvm? Below is the simple Python script I'm using to test:
#!/usr/bin/env python
# Simple classifier test.
# Adapted from the svm_test.py file included in the standard libsvm distribution.
from collections import defaultdict
from svm import *
# Define our sparse data formatted training and testing sets.
labels = [1,2,3,4]
train = [ # key: 0=mean, 1=stddev
{0:2.5,1:3.5},
{0:5,1:1.2},
{0:7,1:3.3},
{0:10.3,1:0.3},
]
problem = svm_problem(labels, train)
test = [
({0:3, 1:3.11},1),
({0:7.3,1:3.1},3),
({0:7,1:3.3},3),
({0:9.8,1:0.5},4),
]
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10, probability = 1)
model = svm_model(problem, param)
for test_sample,correct_label in test:
pred_label, pred_probability = model.predict_probability(test_sample)
correct[kn] += pred_label == correct_label
# Show results.
print '-'*80
print 'Accuracy:'
for kn,correct_count in correct.iteritems():
print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test))
The domain seems fairly simple. I'd expect that if it's trained to know a mean of 2.5 means label 1, then when it sees a mean of 2.4, it should return label 1 as the most likely classification. However, each kernel has an accuracy of 0%. Why is this?
A couple of side notes, is there a way to hide all the verbose training output dumped by libsvm in the terminal? I've searched libsvm's docs and code, but I can't find any way to turn this off.
Also, I had wanted to use simple strings as the keys in my sparse dataset (e.g. {'mean':2.5,'stddev':3.5}). Unfortunately, libsvm only supports integers. I tried using the long integer representation of the string (e.g. 'mean' == 1109110110971110), but libsvm seems to truncate these to normal 32-bit integers. The only workaround I see is to maintain a separate "key" file that maps each string to an integer ('mean'=0, 'stddev'=1). But obviously that'll be a pain since I'll have to maintain and persist a second file along with the serialized classifier. Does anyone see an easier way?
The problem seems to be coming from combining multiclass prediction with probability estimates.
If you configure your code not to make probability estimates, it actually works, e.g.:
<snip>
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
model = svm_model(problem, param)
for test_sample,correct_label in test:
# Here -> change predict_probability to just predict
pred_label = model.predict(test_sample)
correct[kn] += pred_label == correct_label
</snip>
With this change, I get:
--------------------------------------------------------------------------------
Accuracy:
polynomial 1.000000 (4 of 4)
rbf 1.000000 (4 of 4)
linear 1.000000 (4 of 4)
Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.
If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.
By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.
For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use
distance(p1, p2) = KL(p1, p2) + KL(p1, p2)
For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)
Long story short:
Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.
To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.