I have already a classifier trained that I load up through pickle.
My main doubt is if there is anything that can speed up the classification task. It is taking almost 1 minute for each text (feature extraction and classification), is that normal? Should I go on multi-threading?
Here some code fragments to see the overall flow:
for item in items:
review = ''.join(item['review_body'])
review_features = getReviewFeatures(review)
normalized_predicted_rating = getPredictedRating(review_features)
item_processed['rating'] = str(round(float(normalized_predicted_rating),1))
def getReviewFeatures(review, verbose=True):
text_tokens = tokenize(review)
polarity = getTextPolarity(review)
subjectivity = getTextSubjectivity(review)
taggs = getTaggs(text_tokens)
bigrams = processBigram(taggs)
freqBigram = countBigramFreq(bigrams)
sort_bi = sortMostCommun(freqBigram)
adjectives = getAdjectives(taggs)
freqAdjectives = countFreqAdjectives(adjectives)
sort_adjectives = sortMostCommun(freqAdjectives)
word_features_adj = list(sort_adjectives)
word_features = list(sort_bi)
features={}
for bigram,freq in word_features:
features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
for word,freq in word_features_adj:
features['contains(%s)' % unicode(word).encode('utf-8')] = True
features["count({})".format(unicode(word).encode('utf-8'))] = freq
features["polarity"] = polarity
features["subjectivity"] = subjectivity
if verbose:
print "Get review features..."
return features
def getPredictedRating(review_features, verbose=True):
start_time = time.time()
classifier = pickle.load(open("LinearSVC5.pickle", "rb" ))
p_rating = classifier.classify(review_features) # in the form of "# star"
predicted_rating = re.findall(r'\d+', p_rating)[0]
predicted_rating = int(predicted_rating)
best_rating = 5
worst_rating = 1
normalized_predicted_rating = 0
normalized_predicted_rating = round(float(predicted_rating)*float(10.0)/((float(best_rating)-float(worst_rating))+float(worst_rating)))
if verbose:
print "Get predicted rating..."
print "ML_RATING: ", normalized_predicted_rating
print("---Took %s seconds to predict rating for the review---" % (time.time() - start_time))
return normalized_predicted_rating
NLTK is a great tool and a good starting point for Natural Language Processing, but it's sometimes not very useful if speed is important as the authors implicitly said:
NLTK has been called βa wonderful tool for teaching, and working in, computational linguistics using Python,β and βan amazing library to play with natural language.β
So if your problem only lies in the speed of the classifier of the toolkit you have to use another ressource or you have to write the classifier by yourself.
Scikit might be helpful for you if you want to use a classifier which is probably faster.
It seems that you use a dictionary to build the feature vector. I strongly suspect that the problem is there.
The proper way would be using a numpy ndarray, with examples on rows and features on columns. So, something like
import numpy as np
# let's suppose 6 different features = 6-dimensional vector
feats = np.array((1, 6))
# column 0 contains polarity, column 1 subjectivity, and so on..
feats[:, 0] = polarity
feats[:, 1] = subjectivity
# ....
classifier.classify(feats)
Of course, you must use the same data structure and respect the same convention during training.
Related
I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.
TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)
You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)
I am a PhD researcher and started using word2vec for my research. I just want to use it for calculating sentence similarity. I searched and found few links but I couldn't run those. I was looking at the following:
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model,num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
unfortunately I couldn't run this as I don't know how I can find "index2word_set". In addition, should I assign model= ? Or, are there any easy commands or instruction to implement this?
assign model to the model that you generated or any predefined word2vec model that you want to use
and as for index2word_set, you can set it to model.wv
It should work then.
I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.
I am trying to get Highest frequency terms out of vectors in scikit-learn.
From example It can be done using this for each Categories but i want it for each files inside categories.
https://github.com/scikit-learn/scikit-learn/blob/master/examples/document_classification_20newsgroups.py
if opts.print_top10:
print "top 10 keywords per class:"
for i, category in enumerate(categories):
top10 = np.argsort(clf.coef_[i])[-10:]
print trim("%s: %s" % (
category, " ".join(feature_names[top10])))
I want to do this for each files from testing dataset instead of each categories.
Where should i be looking?
Thanks
EDIT: s/discrimitive/highest frequency/g (Sorry for the confusions)
You can use the result of transform together with get_feature_names to obtain the term counts for a given document.
X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])
Seems nobody know . I am answering here as other people face the same problem , i got where to look for now , have not fully implement it yet.
it lies deep inside CountVectorizer from sklearn.feature_extraction.text :
def transform(self, raw_documents):
"""Extract token counts out of raw text documents using the vocabulary
fitted with fit or the one provided in the constructor.
Parameters
----------
raw_documents: iterable
an iterable which yields either str, unicode or file objects
Returns
-------
vectors: sparse matrix, [n_samples, n_features]
"""
if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
raise ValueError("Vocabulary wasn't fitted or is empty!")
# raw_documents can be an iterable so we don't know its size in
# advance
# XXX #larsmans tried to parallelize the following loop with joblib.
# The result was some 20% slower than the serial version.
analyze = self.build_analyzer()
term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
return self._term_count_dicts_to_matrix(term_counts_per_doc)
I have added self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) and it make it able to call from vectorizer outside like this :
load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)
data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'
categories = None # for case categories == None
print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print
# split a training set and a test set
print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
stop_words='english',charset_error="ignore")
X_train = vectorizer.fit_transform(data_train.data)
print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print
print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
print counter
here is my fork , i also submitted pull requests:
https://github.com/v3ss0n/scikit-learn
Please suggest me if there any better way to do.
my first post here!
I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features.
An example:
"My name is Obama", 001
...
Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}
Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram..
What's wrong in my approach?
Thank you!
def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)
...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
t = t.split("'")
code = t[0] #class
desc = t[1] # description
words = words.union(s) #update dictionary with the new words in the description
entries.append((s,code))
t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training
Use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory.
from nltk.classify import apply_features
More Information and a Example here
You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis.
Consider looking into this