I am trying to use k-means clustering to classify text documents. Is it possible to take a set of documents tfidf vectorize them and perform the computation then add more documents to be classified?
This is what I have so far
true_k = 4
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
How would I add more documents to X? Because I would like to pickle X and save it.
Actually this is pretty simple (controrary to the accepted answer, which suggests that this is complex - it is not). Just concatenate your data, and reuse the same vectorizer (if you create new one, or refit the old one, as suggested in the accepted answer, it will change its estimations and consequently you will get different feature spaces), thus you have to pickle it too
true_k = 4
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
now you get new data, documents2 and simply do
X2 = vectorizer.transform(documents2)
X = np.vstack((X, X2))
model.fit(X) # optimally you would start from the previous solution, but sklearn does not yet support it
However, remember that this assumes that your first batch of documents was already representative for the whole dataset. In other words, you will limit yourself to words from the first documents, and also the idf normalization will not be refitted. You could actually remove both limitations, but you would have to implement your own - online tfidf vectorizer, which can update its estimates. It is not hard to do, but you would have to (after each new batch of documents) also update previous ones (as idf part would change). Easier solution would be to instead keep only countvectorizer and update it, and compute "idf" part independently and apply it on top (just before kmeans).
The problem is that your X feature matrix if of the shape [n_docs, n_features]. Hence, if you create a new feature matrix with new documents you would have to make sure that the new feature matrix (X2) has exactly the same features as X. I cannot image a application where this is feasible.
But if you know that both have the same feature space, you can use scipy.sparse.vstack to append the new documents to your feature matrix:
from scipy.sparse import vstack
X = vstack((X, X2))
EDIT: To ensure the same feature space in X2, you can use the vocabulary keyword argument in TfidfVectorizer, e.g.:
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer = vectorizer.fit(documents)
X = vectorizer.transform(documents)
# do whatever with X
new_vectorizer = TfidfVectorizer(stop_words='english', vocabulary=vectorizer.vocabulary_)
X2 = vectorizer.fit_transform(new_documents)
X = vstack((X, X2))
That means, on top of saving X you also need to store vectorizer.vocabulary_.
Related
I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch
The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
vectorizer.fit_transform(test['message'].values)
with:
vectorizer.transform(test['message'].values)
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.
If I use the TfidfVectorizer from sklearn to generate feature vectors as:
features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)
How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.
Would it be a correct approach, to extract the feature names with:
feature_names = TfidfVectorizer.get_feature_names()
and then count the term frequency for the new document according to the feature_names?
But then I won't get the weights that have the information of a words importance.
You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:
vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)
I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
Then if you need to continue the training:
new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space
Where corpus is bag-of-words
As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!
If you like sci-kit, gensim is also compatible with numpy
I've been learning how to use machine-learning classifiers lately and got started thinking if there was anything in sklearn that could take in either an array or a distribution for each i,j cell as training data? Does such a classification algorithm exist in scikit-learn? If so, how is it used? If not, can someone provide some insight into any algorithms that are known to handle this type of data?
Somebody asked kind of a similar question: https://stats.stackexchange.com/questions/178109/linear-regression-problem-with-multi-dimensional-vectors-instead-of-scalar-value#comment443880_178109 but it was for Regression and it was also never answered.
I tried using just a RandomForestClassifier but it didn't like the array instead of the scalar. If it's more a Bayesian problem, I would be keen on using PyMC3 but I don't even know what algorithms to look at to even start the process.
from sklearn.ensemble import RandomForestClassifier
# Create 2 Distinguishable Classes, 20 of each
# Where each column has a fixed size for the array (e.g. `attr_0`=3, `attr_1`=5, `attr_2`=2)
class_A = np.vstack([[np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)] for k in range(20)])
class_B = np.vstack([[np.random.normal(loc=15, scale=1,size=3),
np.random.normal(loc=20, scale=1,size=5),
np.random.normal(loc=30, scale=1,size=2)] for k in range(20)])
# Merge them
Ar_data = np.concatenate([class_A,class_B], axis=0)
X = pd.DataFrame(Ar_data, columns=["attr_0","attr_1","attr_2"])
# Create target vector
y = np.array(20*[0] + 20*[1])
# Test data
X_test = [np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)]
X_test
# [array([-0.15510844, 0.04567395, -0.66192602]),
# array([ 4.5412568 , 4.32526163, 4.56558114, 5.48178697, 5.2559264 ]),
# array([ 9.17293292, 10.19746434])]
I tried fitting a RandomForestClassifier but it didn't work :(
Mod_rf = RandomForestClassifier()
Mod_rf.fit(X,y)
# ValueError: setting an array element with a sequence.
I have a bunch of files with articles. For each article there should be some features, like: text length, text_spam (all are ints or floats, and in most cases they should be loaded from csv). And what I want to do is - to combine these features with CountVectorizer and then classify those texts.
I have watched some tutorials, but still I have no idea how to implement this stuff. Found something here, but can't actually implement this for my needs.
Any ideas how that could be done with scikit?
Thank you.
What I came across right now is:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
measurements = [
{'text_length': 1000, 'text_spam': 4.3},
{'text_length': 2000, 'text_spam': 4.1},
]
corpus = [
'some text',
'some text 2 hooray',
]
vectorizer = DictVectorizer()
count_vectorizer = CountVectorizer(min_df=1)
first_x = vectorizer.fit_transform(measurements)
second_x = count_vectorizer.fit_transform(corpus)
combined_features = FeatureUnion([('first', first_x), ('second', second_x)])
For this bunch of code I do not understand how to load "real"-data, since training sets are already loaded. And the second one - how to load categories (y parameter for fit function)?
You're misunderstanding FeatureUnion. It's supposed to take two transformers, not two batches of samples.
You can force it into dealing with the vectorizers you have, but it's much easier to just throw all your features into one big bag per sample and use a single DictVectorizer to make vectors out of those bags.
# make a CountVectorizer-style tokenizer
tokenize = CountVectorizer().build_tokenizer()
def features(document):
terms = tokenize(document)
d = {'text_length': len(terms), 'text_spam': whatever_this_means}
for t in terms:
d[t] = d.get(t, 0) + 1
return d
vect = DictVectorizer()
X_train = vect.fit_transform(features(d) for d in documents)
Don't forget to normalize this with sklearn.preprocessing.Normalizer, and be aware that even after normalization, those text_length features are bound to dominate the other features in terms of scale. It might be wiser to use 1. / text_length or np.log(text_length) instead.
And the second one - how to load categories (y parameter for fit function)?
Depends on how your data is organized. scikit-learn has a lot of helper functions and classes, but it does expect you to write code if your setup is non-standard.
I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.