I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)
Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)
You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)
Related
I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I have several logs that are manipulated by two different TfIdfVectorizer objects.
The first one reads and splits the log in ngrams
with open("my/log/path.txt", "r") as test:
corpus = [test.read()]
tf = TfidfVectorizer(ngram_range=ngr)
corpus_transformed = tf.fit_transform(corpus)
infile.close()
The resulting data is written in a Pandas dataframe that has 4 columns
(score [float], review [ngrams of text], isbad [0/1], kfold [int]).
Initial kfold value is -1.
where I have:
my_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf.get_feature_names()).transpose()
For cross-validation I split the dataset in test and train with StratifiedKFolds by doing a simple:
ngr=(1,2)
for fold_ in range(5):
# 'reviews' column has short sentences expressing an opinion
train_df = df[df.kfold != fold_].reset_index(drop=True)
test_df = df[df.kfold == fold_].reset_index(drop=True)
new_tf = TfidfVectorizer(ngram_range=ngr)
new_tf.fit(train_df.reviews)
xtrain = new_tf.transform(train_df.reviews)
xtest = new_tf.transform(test_df.reviews)
And only after this double tfidf transformation I fit my SVC model with:
model.fit(xtrain, train_df.isbad) # where 'isbad' column is 0 if negative and 1 if positive
preds = model.predict(xtest)
accuracy = metrics.accuracy_score(test_df.isbad, preds)
So at the end of the day I have my model that classifies reviews in both classes (negative-0 or positive-1), I dump my model and both tfidf vectorizers (tf and new_tf) but when it comes to new data, even if I do:
with open("never/seen/data.txt", "r") as unseen: # load REAL SAMPLE data
corpus = [unseen.read()]
# to transform the unseen data I use one of the dumped tfidf's obj
corpus_transformed = tf_dump.transform(corpus)
unseen.close()
my_unseen_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf_dump.get_feature_names()).transpose()
my_unseen_df = my_unseen_df.sample(frac=1).reset_index(drop=True) # randomize rows
# to transform reviews' data that are going to be classified I use the new_tf dump, like before
X = new_tf_dump.transform(my_unseen_df.reviews)
# use the previously loaded model and make predictions
res = model_dump.predict(X)
#print(res)
I got ValueError: X has 604,969 features, but SVC is expecting 605,424 as input, but how is that possibile if I manipulate the data with the same objects? What am I doing wrong here?
I want to use my trained model as a classifier for new, unseen data. Isn't this the right way to go?
Thank you.
I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.
I am trying to do some bad case analysis on my product categorization model using SHAP. My data looks something like this:
corpus_train, corpus_test, y_train, y_test = train_test_split(data['Name_Description'],
data['Category_Target'],
test_size = 0.2,
random_state=8)
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word')
X_train = vectorizer.fit_transform(corpus_train)
X_test = vectorizer.transform(corpus_test)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
X_train_sample = shap.sample(X_train, 100)
X_test_sample = shap.sample(X_test, 20)
masker = shap.maskers.Independent(data=X_test_sample)
explainer = shap.LinearExplainer(model, masker=masker)
shap_values = explainer.shap_values(X_test_sample)
X_test_array = X_test_sample.toarray()
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names(), class_names=data['Category'].unique())
Now to save space I didn't include the actual summary plot, but it looks fine. My issue is that I want to be able to analyze a single prediction and get something more along these lines:
In other words, I want to know which specific words contribute the most to the prediction. But when I run the code in cell 36 in the image above I get an
AttributeError: 'numpy.ndarray' object has no attribute 'output_names'
I'm still confused on the indexing of shap_values. How can I solve this?
I was unable to find a solution with SHAP, but I found a solution using LIME. The following code displays a very similar output where its easy to see how the model made its prediction and how much certain words contributed.
c = make_pipeline(vectorizer, classifier)
# saving a list of strings version of the X_test object
ls_X_test= list(corpus_test)
# saving the class names in a dictionary to increase interpretability
class_names = list(data.Category.unique())
# Create the LIME explainer
# add the class names for interpretability
LIME_explainer = LimeTextExplainer(class_names=class_names)
# explain the chosen prediction
# use the probability results of the logistic regression
# can also add num_features parameter to reduce the number of features explained
LIME_exp = LIME_explainer.explain_instance(ls_X_test[idx], c.predict_proba)
LIME_exp.show_in_notebook(text=True, predict_proba=True)
Using the kernalSHAP, first you need to find the shaply value and then find the single instance, as following below;
#convert your training and testing data using the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_train = tfidf_vectorizer.fit_transform(IV_train)
tfidf_test = tfidf_vectorizer.transform(IV_test)
model=LogisticRegression()
model.fit(tfidf_train, DV_train)
#shap apply
#first shorten the data & convert to data frame
X_train_sample = tfidf_train[0:20]
sample_text = pd.DataFrame(X_test_sample)
SHAP_explainer = shap.KernelExplainer(model.predict, X_train_sample)
shap_vals = SHAP_explainer.shap_values(X_test_sample)
#print it.
print(df_test.iloc[7].Text , df_test.iloc[7].Label)
shap.initjs()
shap.force_plot(SHAP_explainer.expected_value, shap_vals[7,:],sample_text.iloc[7,:], feature_names=tfidf_vectorizer.get_feature_names_out())
as the original text is "good article interested natural alternatives treat ADHD" and Label is "1"
I am trying to run a classifier on some movie review data. The data had already been separated into reviews_train.txt and reviews_test.txt. I then loaded the data in and separated each into review and label (either positive (0) or negative (1)) and then vectorized this data. Here is my code:
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
#read the reviews and their polarities from a given file
def loadData(fname):
reviews=[]
labels=[]
f=open(fname)
for line in f:
review,rating=line.strip().split('\t')
reviews.append(review.lower())
labels.append(int(rating))
f.close()
return reviews,labels
rev_train,labels_train=loadData('reviews_train.txt')
rev_test,labels_test=loadData('reviews_test.txt')
#vectorizing the input
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors_train, labels_train)
#prediction
pred=clf.predict(vectors_test)
#print accuracy
print (accuracy_score(pred,labels_test))
However I keep getting this error:
ValueError: Number of features of the model must match the input.
Model n_features is 118686 and input n_features is 34169
I am pretty new to Python so I apologize in advance if this is a simple fix.
The problem is right here:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
You call fit_transform on both the training and testing data. fit_transform simultaneously creates the model stored in vectorizer then uses the model to create the vectors. Because you call it twice, what's happening is that vectors_train is first created and the output feature vectors are generated then you overwrite the model with the second call to fit_transform with the test data. This results in the difference in vector size as you trained the decision tree with different length features in comparison to the test data.
When performing testing, you must transform the data with the same model that was used for training. Therefore, don't call fit_transform on the testing data - just use transform instead to use the already created model:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.transform(rev_test) # Change here