I am struggling with a dimension error when I try to predict using naive bayes classifier.
The data consists of a column for sentences and then a column for sentiments (aka labels). I want to use a naive bayes classifier to predict the sentiment of each sentence.
I start off with separating out testing, training and validation data sets
import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,TfidfVectorizer, TfidfTransformer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2
training_set,sentence_split_further,training_set_sentiments,sentiments_split_further=train_test_split(sentence_data.Sentence,sentence_data.Sentiment,test_size=.5, train_size=.5, random_state=1)
testing_set,validation_set,testing_set_sentiments,validation_set_sentiments=train_test_split(sentence_split_further,sentiments_split_further,test_size=.5, train_size=.5, random_state=1)
Then I create a feature matrix, apply tfid and prune the best k words. I did this all in a function that I created called feature_selection_vector
tfidf_testing_feature_matrix=feature_selection_vector(testing_set,testing_set_sentiments)
tfidf_validation_feature_matrix=feature_selection_vector(validation_set,validation_set_sentiments)
Here is the code for the feature_selection_vector function
def feature_selection_vector( sentence_data, sentiments ):
#creates the feature vector and calculates tfid
vectorizer = CountVectorizer(analyzer='word',
token_pattern=r'\b[a-zA-Z]{3,}\b',
ngram_range=(1, 1)
)
count_vectorized = vectorizer.fit_transform(sentence_data)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
vectorized = tfidf_transformer.fit_transform(count_vectorized)
vector=pd.DataFrame(vectorized.toarray(),
index=['sentence '+str(i)
for i in range(1, 1+len(sentence_data))],
columns=vectorizer.get_feature_names())
selector = SelectKBest(chi2, k=1000)
selector.fit(vector, sentiments)
return vector
Now I want to fit the Naive Bayes Classifier with training data and then use the model to predict using testing data.
naive_bayes = MultinomialNB()
naive_bayes.fit(tfidf_training_feature_matrix,training_set_sentiments)
NBC_tfidf_sentiment_predicted=naive_bayes.predict(tfidf_testing_feature_matrix)
However I keep getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 892 is different from 348)
The two sizes it is complaining about is number of columns of the training set (892) and the number of columns of the testing set (348)
You cannot use fit_transform to get features for the validation and test sets, as you do here (using your feature_selection_vector() function).
fit_transform is used only once with the training data; for the validation and test ones, simple transform should be used instead, using the existing CountVectorizer and TfidfTransformer as they have already been fitted to the training data.
In your code, both the CountVectorizer and TfidfTransformer are fitted again with the validation and test data, leading to different number of features, and eventually to the expected error you report.
For more details, see What is the difference between fit_transform and transform in sklearn countvectorizer?
You should seriously think wrapping up all the stages in a pipeline.
Related
I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I would like to train a classifier both considering NLP features extracted via CountVectorize, and linguistics features manually engineered on the original data-set. For instance
#suppose df_train_multi is our original dataframe object
# split the dataset into training and validation datasets
from sklearn import model_selection
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df_train_multi['Text'], df_train_multi['label'])
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df_train_multi['Text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
Suppose i would like to use as feature for the classification the number of character for each text.
df_train_multi['char_count'] = df_train_multi['Text'].apply(len)
Then i would like to train my classifier considering both the features present in the scipy sparse matrix and the feature 'char_count'
svm_bin = LinearSVC() # linear svm with default parameters
svm_bin_clf = svm_bin.fit(xtrain_count,y_train)
But how can i combine xtrain_count features with the df_train_multi['char_count'] one?
This question already has an answer here:
Testing text classification ML model with new data fails
(1 answer)
Closed 2 years ago.
Below is my code I am trying for text classification model;
from sklearn.feature_extraction.text import TfidfVectorizer
ifidf_vectorizer = TfidfVectorizer()
X_train_tfidf = ifidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape
(3, 16)
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)
Till now only training set has been vectorized into a full vocabulary. In order to perform analysis on test set I need to submit it to the same procedures.
So I did
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
And finally when trying to predict its showing error;
predictions = clf.predict(X_test_tfidf)
ValueError: X has 12 features per sample; expecting 16
But when I use pipeline from sklearn.pipeline import Pipeline then it worked fine;
Can’t I code the way I was trying?
The error is with fit_transform of test data. You fit_transform training data and only transform test data:
# change this
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
# to
X_test_tfidf = ifidf_vectorizer.transform(X_test)
X_test_tfidf.shape
Reasons:
When you do fit_transform, you teach your model the vectors with fit. The model learns the vectors to which they are used to transform data. You use the train data to learn the vectors, then you apply them to both train and test with transform
If you do a fit_transform on test data, you replaced the vectors learned in training data and replaced them with test data. Given that your test data is smaller than your train data, it is likely you would get two different vectorisation.
A Better Way
The best way to do what you do is using Pipelines which will make your flow easy to understand
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
clf = Pipeline(steps=[
('vectorizer', TfidfVectorizer()),
('model', LinearSVC()),
])
# train
clf.fit(X_train,y_train)
# predict
clf.predict(X_test)
This is easier as the transformation are taking care for you. You don’t have to worry about fit_transform when fitting the model or transform when predicting or scoring.
You can access the features independently if you with with
clf.named_steps('vectorizer') # or 'model'
Under the hood, when you do clf.fit, your data will pass throw your vectorizer using fit_transform and then to the model. When you predict or score, your data will pass throw your vectorizer with transform before reaching your model.
Your code fails as you are refitting the vectorizer with .fit_transform() on the test set X_test again. However, you should only transform the data with the vectorizer:
X_test_tfidf = ifidf_vectorizer.transform(X_test)
Now it should work as expected. You only fit the ifidf_vectorizer according to X_train and transform all data according to this. It ensures that the same vocabulary is used and that you get outputs of the same shape.
I'm trying to predict the no.of updates('sys_mod_count')based on the text description('eng')
I have predefined the 'sys_mod_count' into two classes if >=17 as 1; <17 as 0.
But I want to remove this condition as this value is not available at decision time in real world.
I'm thinking to do this in Decision tree/ Random forest method to train the classifier on feature set.
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
# return metrics.accuracy_score(predictions, valid_y)
return predictions
import pandas as pd
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
df_3 =pd.read_csv('processedData.csv', sep=";")
st_new = df_3[['sys_mod_count','eng','ger']]
st_new['updates_binary'] = st_new['sys_mod_count'].apply(lambda x: 1 if x >= 17 else 0)
st_org = st_new[['eng','updates_binary']]
st_org = st_org.dropna(axis=0, subset=['eng']) #Determine if column 'eng'contain missing values are removed
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(st_org['eng'], st_org['updates_binary'],stratify=st_org['updates_binary'],test_size=0.20)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(st_org['eng'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", metrics.accuracy_score(accuracy, valid_y))
This seems to be a threshold setting problem - you would like to set a threshold at which a certain classification is made. No supervised classifier can set the threshold for you because if it does not have any training data with binary classes, then you cannot train the cvlassifier, and to create training data, you need to set the threshold to begin with. It's a chicken and egg problem.
If you have some way of identifying which binary label is correct, then you can vary the threshold and measure errors similar to how it's suggested here. Then you can either run a Classifier on your binary labels based on the threshold or a Regressor on sys_mod_count and convert to binary based on the identified threshold.
The above approach does not work if you have no way to identify what the correct binary label should be. Then, the problem you are trying to solve is creating some boundary between points based on the value of your sys_mod_count variable. This is unsupervised learning. So, techniques like clustering will be helpful here. You can cluster your data into two clusters based on the distance of points from each other, and then label each cluster, which becomes your binary label.
I'm using scikit-learn's (sklearn) linear SVM (LinearSVC) and I'm currently trying to remove the 10% most predictive features for doing sentiment analysis on 3 classes (positive, negative and neutral) in order to see if I can prevent overfitting while working on domain adaptation. I know that it's possible to access the feature weights by using svm.LinearSVC().coef_ but I'm not sure how to remove the 10% most predictive features. Does anyone know to proceed? In advance, thanks for your help. Here's my code:
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer as cv
# Using linear SVM classifier
clf = svm.LinearSVC()
# Count vectorizer used by the SVM classifier (select which ngrams to use here)
vec = cv(lowercase=True, ngram_range=(1,2))
# Fit count vectorizer with training text data
vec.fit(trainStringList)
# X represents the text data from the respective datasets
# Transforms text into vectors for the training set
X_train = vec.transform(trainStringList)
#transforms text into vectors for the test set
X_test = vec.transform(testStringList)
# Y represents the labels from the respective datasets
# Converting labels from the respective data sets to integers (0="positive", 1= "neutral", 2= "negative")
Y_train = trainLabels
Y_test = testLabels
# Fitting the training data to the linear SVM classifier
clf.fit(X_train,Y_train)
for feature_vector in clf.coef_:
???
Coefficients with the highest weights will indicate highest importance when generating predictions. You can eliminate the features associated with these parameters. I do not advise this. If your goal is is to reduce overfitting, C is the regularization parameter in this model. Provide a higher C value when initiating the LinearSVC object (default is 1):
clf = svm.LinearSVC(C=10)
You should be doing some sort of cross-validation to determine the best values for hyperparameters, such as C.