How to perform text classification with naive bayes using sklearn library? - python

I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.
{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}
This is what i have tried so far,
import numpy as np
import json
from sklearn.naive_bayes import *
topic = []
question = []
excerpt = []
with open('training.json') as f:
for line in f:
data = json.loads(line)
topic.append(data["topic"])
question.append(data["question"])
excerpt.append(data["excerpt"])
unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]
x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( ['hello'] )
But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?

All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.
vect = TfidfVectorizer()
X = vect.fit_transform(X)
Obviously, you'll need to vectorize your test set in the same way
clf.predict( vect.transform(['hello']) )
See a tutorial on using sklearn with textual data.

Related

how to do one hot encoding for text in a paragraph at the sentence level?

I have sentences stored in text file which looks like this.
radiologicalreport =1. MDCT OF THE CHEST History: A 58-year-old male, known case lung s/p LUL segmentectomy. Technique: Plain and enhanced-MPR CT chest is performed using 2 mm interval. Previous study: 03/03/2018 (other hospital) Findings: Lung parenchyma: The study reveals evidence of apicoposterior segmentectomy of LUL showing soft tissue thickening adjacent surgical bed at LUL, possibly post operation.
My ultimate goal is to apply LDA to classify each sentence to one topic. Before that, I want to do one hot encoding to the text. The problem I am facing is I want to one hot encode per sentence in a numpy array to be able to feed it into LDA. If I want to one hot encode the full text, I can easily do it using these two lines.
sent_text = nltk.sent_tokenize(text)
hot_encode=pd.Series(sent_text).str.get_dummies(' ')
However, my goal is to one hot encoding per sentence in a numpy array. So, I try the following code.
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import nltk
import pandas as pd
from nltk.tokenize import TweetTokenizer, sent_tokenize
with open('radiologicalreport.txt', 'r') as myfile:
report=myfile.read().replace('\n', '')
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(report)]
tokens_np = array(tokens_sentences)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(tokens_np)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
I get an error at this line as "TypeError: unhashable type: 'list'"
integer_encoded = label_encoder.fit_transform(tokens_np)
And hence cannot proceed further.
Also, my tokens_sentences look like this as shown in the image.
Please Help!!
You are trying to transform labels to numerical values using fit_transform (in your example labels are lists of words -- tokens_sentences).
But non-numerical labels can be transformed only if they are hashable and comparable (see the docs). Lists are not hashable but you can convert them to tuples:
tokens_np = array([tuple(s) for s in tokens_sentences])
# also ok: tokens_np = [tuple(s) for s in tokens_sentences]
and then you can get your sentences encoded in integer_encoded
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(tokens_np)

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

scikit-learn - Tfidf on HashingVectorizer

I am using SciKit Learn to perform some analytics on a large dataset (+- 34.000 files). Now I was wondering. The HashingVectorizer aims on low memory usage. Is it possible to first convert a bunch of files to HashingVectorizer objects (using pickle.dump) and then load all these files together and convert them to TfIdf features? These features can be calculated from the HashingVectorizer, because counts are stored and the number of documents can be deduced. I now have the following:
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
with open(path, 'wb') as handle:
pickle.dump(features, handle)
Then, loading the files is trivial:
data = []
for path in paths:
with open(path, 'rb') as handle:
data.append(pickle.load(handle))
tfidf = TfidfVectorizer()
tfidf.fit_transform(data)
But, the magic does not happen. How can I let the magic happen?
It seems the problem is you are trying to vectorizing your text twice. Once you have built a matrix of counts, you should be able to transform the counts to tf-idf features using sklearn.feature_extraction.text.TfidfTransformer instead of TfidfVectorizer.
Also, it appears your saved data is a sparse matrix. You should be stacking the loaded matrices using scipy.sparse.vstack() instead of passing a list of matrices to TfidfTransformer
I'm quite worried by your loop
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
Each time you re-fit your vectoriser, maybe it will forget its vocabulary, and so the entries in each vector won't correspond to the same words (not sure about this i guess it depends on how they do the hashing); why not just fit it on the whole corpus, i.e.
features = vectorizer.fit_transform(texts)
For you actual question, it sounds like you are just trying to normalise the columns of your data matrix by the IDF; you should be able to do this directly on the arrays (i've converted to numpy arrays since I can't work out how the indexing works on the scipy arrays). The mask DF != 0 is necessary since you used the hashing vectoriser which has 2^20 columns:
import numpy as np
X = np.array(features.todense())
DF = (X != 0).sum(axis=0)
X_TFIDF = X[:,DF != 0]/DF[DF != 0]

Feature Importance extraction of Decision Trees (scikit-learn)

I've been trying to get a grip on the importance of features used in a decision tree i've modelled. I'm interested in discovering the weight of each feature selected at the nodes as well as the term itself. My data is a bunch of documents.
This is my code for the decision tree, I modified the code snippet from scikit-learn that extract (http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html):
from sklearn.feature_extraction.text import TfidfVectorizer
### Feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,
use_idf=True, tokenizer=None, ngram_range=(1,2))#ngram_range=(1,0)
tfidf_matrix = tfidf_vectorizer.fit_transform(data[:, 1])
terms = tfidf_vectorizer.get_features_names()
### Define Decision Tree and fit
dtclf = DecisionTreeClassifier(random_state=1234)
dt = data.copy()
y = dt["label"]
X = tfidf_matrix
fitdt = dtclf.fit(X, y)
from sklearn.datasets import load_iris
from sklearn import tree
### Visualize Devision Tree
with open('data.dot', 'w') as file:
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
file.close()
import subprocess
subprocess.call(['dot', '-Tpdf', 'data.dot', '-o' 'data.pdf'])
### Extract feature importance
importances = dtclf.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print('Feature Ranking:')
for f in range(tfidf_matrix.shape[1]):
if importances[indices[f]] > 0:
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
print ("feature name: ", terms[indices[f]])
Am I correct in assuming that using terms[indices[f]] (which is the feature term vector ) will print the actual feature term used to split the tree at a certain node?
The decision tree visualised with GraphViz has for instance X[30], I'm assuming this is refers to the numerical interpretation of the feature term. How do I extract the term itself so I can validate the process I deployed in #1?
Updated code
fitdt = dtclf.fit(X, y)
with open(...):
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
Thanks in advance
For you first question you need to get the feature names out of the vectoriser with terms = tfidf_vectorizer.get_feature_names(). For your second question, you can you can call export_graphviz with feature_names = terms to get the actual names of your variables to appear in your visualisation (check out the full documentation of export_graphviz for many other options that may be useful for improving your visualisation.

How to combine the outputs of multiple naive bayes classifier?

I am new to this.
I have a set of weak classifiers constructed using Naive Bayes Classifier (NBC) in Sklearn toolkit.
My problem is how do I combine the output of each of the NBC to make final decision. I want my decision to be in probabilities and not labels.
I made a the following program in python. I assume 2 class problem from iris-dataset in sklean. For demo/learning say I make a 4 NBC as follows.
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
import numpy as np
import cPickle
import math
iris = datasets.load_iris()
gnb1 = GaussianNB()
gnb2 = GaussianNB()
gnb3 = GaussianNB()
gnb4 = GaussianNB()
#Actual dataset is of 3 class I just made it into 2 class for this demo
target = np.where(iris.target, 2, 1)
gnb1.fit(iris.data[:, 0].reshape(150,1), target)
gnb2.fit(iris.data[:, 1].reshape(150,1), target)
gnb3.fit(iris.data[:, 2].reshape(150,1), target)
gnb4.fit(iris.data[:, 3].reshape(150,1), target)
#y_pred = gnb.predict(iris.data)
index = 0
y_prob1 = gnb1.predict_proba(iris.data[index,0].reshape(1,1))
y_prob2 = gnb2.predict_proba(iris.data[index,1].reshape(1,1))
y_prob3 = gnb3.predict_proba(iris.data[index,2].reshape(1,1))
y_prob4 = gnb4.predict_proba(iris.data[index,3].reshape(1,1))
#print y_prob1, "\n", y_prob2, "\n", y_prob3, "\n", y_prob4
# I just added it over all for each class
pos = y_prob1[:,1] + y_prob2[:,1] + y_prob3[:,1] + y_prob4[:,1]
neg = y_prob1[:,0] + y_prob2[:,0] + y_prob3[:,0] + y_prob4[:,0]
print pos
print neg
As you will notice I just simply added the probabilites of each of NBC as final score. I wonder if this correct?
If I have dont it wrong can you please suggest some ideas so I can correct myself.
First of all - why you do this? You should have one Naive Bayes here, not one per feature. It looks like you do not understand the idea of the classifier. What you did is actually what Naive Bayes is doing internally - it treats each feature independently, but as these are probabilities you should multiply them, or add logarithms, so:
You should just have one NB, gnb.fit(iris.data, target)
If you insist on having many NBs, you should merge them through multiplication or addition of logarithms (which is the same from mathematical perspective, but multiplication is less stable in the numerical sense)
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1]
or
pos = np.exp(np.log(y_prob1[:,1]) + np.log(y_prob2[:,1]) + np.log(y_prob3[:,1]) + np.log(y_prob4[:,1]))
you can also directly predit logarithm through gnb.predict_log_proba instead of gbn.predict_proba.
However, this approach have one error - Naive Bayes will also include prior in each of your prob's, so you will have very skewed distributions. So you have to manually normalize
pos_prior = gnb1.class_prior_[1] # all models have the same prior so we can use the one from gnb1
pos = pos_prior_ * (y_prob1[:,1]/pos_prior_) * (y_prob2[:,1]/pos_prior_) * (y_prob3[:,1]/pos_prior_) * (y_prob4[:,1]/pos_prior_)
which simplifies to
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1] / pos_prior_**3
and for log to
pos = ... - 3 * np.log(pos_prior_)
So once again - you should use the "1" option.
The answer by lejlot is almost correct. The one thing missing is that you need to normalize his pos result (the product of the probabilities, divided by the prior) by the sum of this pos result for both classes. Otherwise, the sum of the probabilities of all classes will not be equal to one.
Here is a sample code that test the result of this procedure for a dataset with 6 features:
# Use one Naive Bayes for all 6 features:
gaus = GaussianNB(var_smoothing=0)
gaus.fit(X, y)
y_prob1 = gaus.predict_proba(X)
# Use one Naive Bayes on each half of the features and multiply the results:
gaus1 = GaussianNB(var_smoothing=0)
gaus1.fit(X[:, :3], y)
y_log_prob1 = gaus1.predict_log_proba(X[:, :3])
gaus2 = GaussianNB(var_smoothing=0)
gaus2.fit(X[:, 3:], y)
y_log_prob2 = gaus2.predict_log_proba(X[:, 3:])
pos = np.exp(y_log_prob1 + y_log_prob2 - np.log(gaus1.class_prior_))
y_prob2 = pos / pos.sum(axis=1)[:,None]
y_prob1 should be equal to y_prob2 apart from numerical errors (var_smoothing=0 helps reducing the error).

Categories

Resources