Text[Multi-Level] Classification with many outputs - python

Problem Statement:
To classify a text document to which category it belongs and also to classify up to two levels of the category.
Sample Training Set:
Description Category Level1 Level2
The gun shooting that happened in Vegas killed two Crime | High Crime High
Donald Trump elected as President of America Politics | High Politics High
Rian won in football qualifier Sports | Low Sports Low
Brazil won in football final Sports | High Sports High
Initial Attempt:
I tried to create a classifier model which would try to classify the Category using Random forest method and it gave me 90% overall.
Code1:
import pandas as pd
#import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))
train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
Problem:
I want to add two more level to the model they are Level1 and Level2 the reasons for adding them is when I ran classification for Level1 alone I got 96% accuracy. I am stuck at splitting training and test dataset and to train a model which has three classifications.
Is it possible to create a model with three classification or should I create three models? How to split train and test data?
Edit1:
import string
import codecs
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue

The scikit-learn Random Forest Classifier natively supports multiple outputs (see this example). Therefore, you do not need to create three separate models.
From the documentation of RandomForestClassifier.fit, the inputs to fit functions are:
X : array-like or sparse matrix of shape = [n_samples, n_features]
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
Therefore, you need an array y (your labels) of size N x 3 as your input to your RandomForestClassifier. In order to split your training and test set, you can do:
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)
Your train_label and test_label should be arrays of size N x 3 that you can use to fit your model compare your predictions (NB: I have not tested it here, you might need to do some transposes).

Related

sklearn vectorizer.get_feature_names_out() error

I am working on a LogisticRegressing tex classifier. The classifier's job is to label data as spam or ham.
Initially I have 1 feature(just the text), but then later I am adding 3 more features:
The length of document (number of characters)
Number of digits per document from the document
Number of non-word characters from the document
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import LogisticRegression
import re
from varname import nameof
##-----------------------------------------------------------------------------
#
def add_feature(X, feature_to_add):
X_modified = hstack([X, csr_matrix(feature_to_add).T], 'csr')
return(X_modified)
##-----------------------------------------------------------------------------
#
def feature_extractor(series_data):
series_doc_len = []
series_digits = []
series_non_alphas = []
entry = 0
for (idx, text) in enumerate(series_data):
text_length = (len(text))
text_digits = sum(c.isdigit() for c in text)
text_non_alphas = re.findall(r'\W+', text)
text_non_alphas_count = len(text_non_alphas)
series_doc_len.append(text_length)
series_digits.append(text_digits)
series_non_alphas.append(text_non_alphas_count)
series_doc_len_series = pd.Series(series_doc_len)
series_digits_series = pd.Series(series_digits)
series_non_alphas_series = pd.Series(series_non_alphas)
series_doc_len_renamed = series_doc_len_series.rename('length_of_doc')
series_digits_renamed = series_digits_series.rename('digit_count')
series_non_alphas_renamed = series_non_alphas_series.rename('non_word_char_count')
return(series_doc_len_renamed, series_digits_renamed, series_non_alphas_renamed)
##-----------------------------------------------------------------------------
#
def load_csv_data(file_name):
spam_data_df = pd.read_csv(file_name)
spam_data_df['target'] = np.where(spam_data_df['target']=='spam',1,0)
X_train, X_test, y_train, y_test = train_test_split(spam_data_df['text'],
spam_data_df['target'],
test_size=0.3,
random_state=0)
return(X_train, X_test, y_train, y_test)
##-----------------------------------------------------------------------------
file_name = "../data/spam-dummy.csv"
X_train, X_test, y_train, y_test = load_csv_data(file_name)
vectorizer = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb')
X_train_vectorized = vectorizer.fit_transform(X_train)
(X_train_doclen, X_train_numdigits, X_train_nonalpha) = feature_extractor(X_train)
for feature in (X_train_doclen, X_train_numdigits, X_train_nonalpha):
X_train_vectorized = add_feature(X_train_vectorized, feature)
X_test_vectorized = vectorizer.transform(X_test)
(X_test_doclen, X_test_numdigits, X_test_nonalpha) = feature_extractor(X_test)
for feature in (X_test_doclen, X_test_numdigits, X_test_nonalpha):
X_test_vectorized = add_feature(X_test_vectorized, feature)
classifier = LogisticRegression(C=100, solver='liblinear')
classifier.fit(X_train_vectorized, y_train)
y_predicted = classifier.predict(X_test_vectorized)
feature_names = np.array(vectorizer.get_feature_names_out() + ['length_of_doc', 'digit_count', 'non_word_char_count'])
sorted_coef_index = classifier.coef_[0].argsort()
smallest = feature_names[sorted_coef_index[:10]]
largest = feature_names[sorted_coef_index[:-11:-1]]
After running the prediction, I am trying to pull smallest/largest coefficients from the model, including the additional three features along with their names.
File "/Users/ukhan/Development/github/education.git/coursera/applied_text_mining_in_python/labs/lab-3/supplimental/code/tfidf-kavitha.py", line 92, in <module>
feature_names = np.array(vectorizer.get_feature_names_out() + ['length_of_doc', 'digit_count', 'non_word_char_count'])
ValueError: operands could not be broadcast together with shapes (15569,) (3,)
What is the correct way to approach this?
I then added the following code to see if the feature-name I added was actually there, but don't see them:
feature_names = np.array(vectorizer.get_feature_names_out())
for feature_name in feature_names:
print(f" Inspecting feature: {feature_name}")
if(feature_name == 'length_of_doc'):
print(f' Feature name: {feature_name} has been found')
elif(feature_name == 'digit_count'):
print(f' Feature name: {feature_name} has been found')
elif(feature_name == 'non_word_char_count'):
print(f' Feature name: {feature_name} has been found')
``

why smote raise "Found input variables with inconsistent numbers of samples"?

I try to classify emotion from tweet with dataset of 4401 tweet, when i use smaller sample of data (around 15 tweet) everything just work fine, but when i use the full dataset it raise the error of
Found input variables with inconsistent numbers of samples: [7, 3520]
the error happen when i try to oversampling the data using smote after transforming the data using countvectorizer.
This is the code where the error raise
# N-gram Feature and Term Frequency
vectorizer = CountVectorizer(ngram_range=(1,3))
x_train_tf = vectorizer.fit_transform(str(x_train).split('\n')).toarray()
x_test_tf = vectorizer.transform(str(x_test).split('\n')).toarray()
df_output = pd.DataFrame(data =x_train_tf, columns = vectorizer.get_feature_names_out())
display(df_output)
# the print shape is (7 rows × 250 columns)
smote = SMOTE(random_state=42, k_neighbors=5)
x_smote, y_smote = smote.fit_resample(x_train_tf, y_train)
print("Total Train Data SMOTE : ",x_smote.shape), print("Total Train Label SMOTE : ",y_smote)
i did not understand why this is happening so some explanation could really help.
i already tried to solve it using answers from other similiar question but nothing have worked.
This is the full code
import nltk
import re
#nltk.download()
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk import everygrams
from collections import Counter
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
dataset = pd.read_csv("G:/TA/Program/dataset/Twitter_Emotion_Dataset.csv", encoding='latin-1')
# Preprocessing
dataset['case_folding_tweet'] = dataset['tweet'].str.casefold()
dataset['only_alphabet_tweet'] = [re.sub('[^a-zA-Z]+\s*', ' ', s) for s in dataset['case_folding_tweet']]
dataset['data_cleaning_tweet'] = dataset['only_alphabet_tweet'].str.replace(r'\b\w{1}\b','').str.replace(r'\s+', ' ')
slangword_dictionary = ("G:/TA/Program/dataset/kamus_singkatan.csv")
deslang = {}
list_slangword = open(slangword_dictionary).readlines()
for line in list_slangword:
slang, unslang = line.strip().split(';')
deslang[slang] = unslang
deslang[slang] = {r"\b{}\b".format(k): v for k, v in deslang.items()}
dataset['data_cleaning_tweet'] = dataset['data_cleaning_tweet'].replace(deslang[slang], regex=True)
dataset['convert_slang_tweet'] = dataset['data_cleaning_tweet']
replace_dictionary = {'tidak ': 'tidak', 'bukan ': 'bukan', 'jangan ': 'jangan', 'belum ': 'belum'}
dataset['convert_negation_tweet'] = dataset['convert_slang_tweet'].replace(replace_dictionary, regex=True)
dataset['tokenization_tweet'] = dataset['convert_negation_tweet'].apply(word_tokenize)
list_stopwords = set(stopwords.words("indonesian"))
list_stopwords.add('username')
list_stopwords.add('url')
dataset['stopword_removal_tweet'] = dataset['tokenization_tweet'].apply(lambda x: [item for item in x if item not in list_stopwords])
factory = StemmerFactory()
stemmer = factory.create_stemmer()
dataset['stemmed_tweet'] = dataset['stopword_removal_tweet'].apply(lambda x: [stemmer.stem(y) for y in x])
# Split data
x = dataset["stemmed_tweet"].values
y = dataset["label"].values
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state= 42)
# Get N-gram and TF
vectorizer = CountVectorizer(ngram_range=(1,3))
x_train_tf = vectorizer.fit_transform(str(x_train).split('\n')).toarray()
x_test_tf = vectorizer.transform(str(x_test).split('\n')).toarray()
# Oversampling
smote = SMOTE(random_state=42, k_neighbors=5)
x_smote, y_smote = smote.fit_resample(x_train_tf, y_train)
print("Total Train Data SMOTE : ",x_smote.shape), print("Total Train Label SMOTE : ",y_smote)
gnb_classifier = GaussianNB()
gnb_classifier.fit(x_smote, y_smote)
print(gnb_classifier)
y_pred = gnb_classifier.predict(x_test_tf)
print("Emotion Predicted :", y_pred)
Link to the dataset
I cannot solve it precisely because I don't have your data, but here are a few observations which should help:
apparently x_train_tf has only 7 rows? it's not enough for training a model and it's not 80% of 4401, as you're supposed to obtain from train_test_split.
Note that y_train has 3520 rows = 4401 * 80%, the correct number of rows.
I suspect that the line x_train_tf = vectorizer.fit_transform(str(x_train).split('\n')).toarray() is not doing what you think it does. Try to decompose the str(x_train).split('\n') part.
i fix the problem using the answer from this post answer
by joining all the train data column before vectorizing.
df_train = pd.DataFrame(data=x_train)
df_test = pd.DataFrame(data=x_test)
series = pd.Series(df_train['stemmed_tweet'])
corpus = series.apply(lambda series: ' '.join(series))
vectorizer = CountVectorizer(ngram_range=(1,3), lowercase=False)
x_train_tf = vectorizer.fit_transform(corpus).toarray()
x_test_tf = vectorizer.transform(str(df_test.values).split("\n")).toarray()

Using Sklearn, Category Predictions not working on the test data

Dataset: I created a very simple dataset of "Supplier", "Item description" columns . This dataset has a list of item descriptions and preferred supplier for that item
Requirement: I would like to write a program that will take an "Item Description" and predict the "Supplier". To keep it very simple, I just have only 5 Unique supplier-Item Description combinations out of the 950 rows in the .txt file
Issue: The accuracy shows up 1 and confusing matrix shows no false positives. But when I give a new data, the prediction is wrong.
Steps Done
Read .txt for "Supplier" and "Item Description"
Label Encoder applied on the "Item Description"
train test and split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
Created a Pipeline for applying the TfidfVectorizer and MultinomialNB
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
fit model and predict :
y_pred=model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
acc= accuracy_score(y_test,y_pred)
# acc is 1.0 and the cm shows no false positives/negatgives
So far, things look ok
dumped the pickle
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
Tried prediction on a Item Description= 'Lego, Barbie and other Toy Items' ; I was expecting "Toys R Us"
The prediction was wrong, it came up as "Office Depot".
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
Can you please let me know
what I am doing wrong here to get the wrong prediction, new_y_pred for the test item description passed in (new_X)
This is my first ML code. I have tried debugging this by looking at various articles, but no luck.
Thanks
== Complete Code, if it is helpful ==
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import re # librarie for cleaning data
import nltk # library for NLP
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import pickle
df=pd.read_csv('git_suppliers.txt', sep='\t')
# Prep the data - Item Description
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = PorterStemmer()
words = stopwords.words("english")
df['ITEM_DESCRIPTION'] = df['ITEM_DESCRIPTION'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z0-9]", " ", x).split() if i not in words]).lower())
# Feature Generation using the TF-IDF
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(df['ITEM_DESCRIPTION']).toarray()
final_features.shape
# final_features shows only 43 features - not going to use SelectKBest for such such less features count
#
# Split into training and test data
#
X = df['ITEM_DESCRIPTION']
y = df['SUPPLIER']
from sklearn.preprocessing import LabelEncoder
labelObj = LabelEncoder()
y=labelObj.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
y_test_decoded=labelObj.inverse_transform(y_test)
#
# Create a pipeline, fit the model, predict for test data and save in pickle
#
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
# Predict for test data
y_pred=model.predict(X_test)
# Accuracy shows up as 1.0 and the confusion matrix shows no false positives/negatives
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
acc= accuracy_score(y_test,y_pred)
print(acc)
# Dump the model and lets predict for one item description,
# for which i expect Toys R Us as the supplier/Seller
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
labelObj.inverse_transform(new_y_pred)
### Shows Office Depot
My bad - the input to the predict was wrong type. Passed in a series and it worked fine.
new_items = pd.Series(new_items)
new_y_pred=loadedModel.predict(new_items)
labelObj.inverse_transform(new_y_pred)

How to compile fitted model for coremltools?

I have made my machine model and need to upload it to Xcode using coremltools. I originally used sklearn ensemble as my machine model but coreml does not support it, so I decided to use either LinearSVC or LogisticRegression, which have the highest accuracy after training.
import numpy as np
import pandas as pd
#load the dataset of the file
#FYI:use quotation marks to escape comma or just not use the sentences
df = pd.read_csv('RhetoricalDevices1.csv', error_bad_lines=False, delimiter= ',', engine='python')
#print useful information about the data set
df.info()
df.head()
#check class distribution--number of each device uploaded
classes1 = df['Rhetorical Devices1']
classes2 = df['Rhetorical Devices2']
from sklearn.preprocessing import LabelEncoder
encoder1 = LabelEncoder()
encoder2 = LabelEncoder()
Y1 = encoder1.fit_transform(classes1.fillna('0'))
Y2 = encoder2.fit_transform(classes2.fillna('0'))
print(encoder1.inverse_transform([6]))
import nltk
from nltk.tokenize import word_tokenize
#creating a bag-of-words model
all_words = []
for sentences in processed:
words = word_tokenize(sentences)
for w in words:
all_words.append(w)
all_words = nltk.FreqDist(all_words)
# use the 2000 most common words as features
word_features = list(all_words.keys())[:2000]
#define a find_feature function
def find_features(sentence):
words = word_tokenize(sentence)
features = {}
for word in word_features:
features[word] = (word in words)
return features
#find features for all sentences
sentences = list(zip(processed, Y1))
#define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(sentences)
#call find_features function for each sentence
featuresets = [(find_features(text), label) for (text, label) in sentences]
# split training and testing data sets using sklearn
from sklearn import model_selection
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state = seed)
names = ['K Nearest Neighbors','Decision Tree','Random Forest','Logistic Regression','SGDClassifier','Multinomial','One Vs Rest Classifier']
classifiers = [
KNeighborsClassifier(n_jobs = -1),
DecisionTreeClassifier(class_weight = 'balanced'),
RandomForestClassifier(),
LogisticRegression(),
SGDClassifier(max_iter = 100, class_weight ='balanced', n_jobs = -1),
MultinomialNB(),
#GaussianProcessClassifier(),
LinearSVC()
]
models = list(zip(names, classifiers))
from nltk.classify.scikitlearn import SklearnClassifier
for name, model in models:
nltk_model = SklearnClassifier(model)
nltk_model.train(training)
accuracy = nltk.classify.accuracy(nltk_model, testing)*100
print("{} Accuracy: {}".format(name, accuracy))
when I tried following code, I get error "TypeError: Expected a 'fitted' model for conversion". How should I fix this?
model = LinearSVC()
coreml_model = coremltools.converters.sklearn.convert(model, 'Samples','Rhetorical Devices')
You should call fit() on your model with your training data, before converting it to CoreML.

ValueError: too many values to unpack (NLTK classifier)

I'm doing classification analysis using NLTK's Naive Bayes classifier. I insert a tsv file containing records and labels.
But the file doesn't get trained due to an error. Here's my python code
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)
dataset.isnull().any()
dataset = dataset.fillna(method='ffill')
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,16004):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in
set(stopwords.words('english'))]
tweet = ' '.join(tweet)
corpus.append(tweet)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,
random_state = 0)
train_set, test_set = X_train[500:], y_train[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
The error is:
File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
NLTKClassifier doesn't work like scikit estimators. It requires the X and y both in a single array which is then passed to train().
But in your code, you are only supplying it the X_train and it tries to unpack y from that and hence the error.
The NaiveBayesClassifier requires the input to be a list of tuples where list denotes the training samples and the tuple has the feature dictionary and label inside. Something like:
X = [({feature1:'val11', feature2:'val12' .... }, class1),
({feature1:'val21', feature2:'val22' .... }, class2),
...
... ]
You need to change your input to this format.
feature_names = cv.get_feature_names()
train_set = []
for i, single_sample in enumerate(X):
single_feature_dict = {}
for j, single_feature in enumerate(single_sample):
single_feature_dict[feature_names[j]]=single_feature
train_set.append((single_feature_dict, y[i]))
Note: The above for loop can be shortened by using dict comprehension but I'm not that fluent there.
Then you can do this:
nltk.NaiveBayesClassifier.train(train_set)

Categories

Resources