Is there any possible way to train a model in a different file with a different list and use prediction with another list? This does not work because vectorizer is not defined. I do not want to re train the data every time I run my program. Is it possible to save the model and just load the model in another file and predict the data with another list?
document = df.stack().tolist()
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)
true_k = 20
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
joblib.dump(model, 'model.joblib')
model = joblib.load('model.joblib')
documentnew = df.stack().tolist()
print("\n")
print("Prediction")
X = vectorizer.transform([documentnew[2]])
predicted = model.predict(X)
When I load the model and I transform the documentnew, i get this error.
ValueError: Incorrect number of features. Got 7 features, expected 39
The approach above works if its in the same file, but I dont want the model to change and train everytime I run the program.
Related
I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I have created ridge regression model to predict sales of an item say X. My final model contains around 180 features. I have pickled this model and now I want to see how model I created is performing on new data set containing same features as present in different model but on different timeframe.
I have to pass entire new dataset(dataframe) into existing model and check relevant score say r-square or any other score relevant to regression model.
Below is the code I'm using:
# loading library
import pickle
with open('ridge_model', 'wb') as p:
pickle.dump(ridge, p)
y_test = df.pop('Target')
x_test = df
# load saved model
with open('ridge_model' , 'rb') as p:
new_reg = pickle.load(p)
r_squared = new_reg.score(x_test,y_test)
print(r-squared)
Here ridge is the regression model I created
df is the new data on which prediction needs to be done
y_test contains target variable
x_test contains features in dataset except target variable
I want to understand:
Can we pass entire new data set onto existing model for prediction and
see how model is performing or not?
r_squared = new_reg.score(x_test,y_test) in this line does .score calculates r-square as
calculated on existing model or any other score it calculates?
I am trying to run a classifier on some movie review data. The data had already been separated into reviews_train.txt and reviews_test.txt. I then loaded the data in and separated each into review and label (either positive (0) or negative (1)) and then vectorized this data. Here is my code:
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
#read the reviews and their polarities from a given file
def loadData(fname):
reviews=[]
labels=[]
f=open(fname)
for line in f:
review,rating=line.strip().split('\t')
reviews.append(review.lower())
labels.append(int(rating))
f.close()
return reviews,labels
rev_train,labels_train=loadData('reviews_train.txt')
rev_test,labels_test=loadData('reviews_test.txt')
#vectorizing the input
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors_train, labels_train)
#prediction
pred=clf.predict(vectors_test)
#print accuracy
print (accuracy_score(pred,labels_test))
However I keep getting this error:
ValueError: Number of features of the model must match the input.
Model n_features is 118686 and input n_features is 34169
I am pretty new to Python so I apologize in advance if this is a simple fix.
The problem is right here:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
You call fit_transform on both the training and testing data. fit_transform simultaneously creates the model stored in vectorizer then uses the model to create the vectors. Because you call it twice, what's happening is that vectors_train is first created and the output feature vectors are generated then you overwrite the model with the second call to fit_transform with the test data. This results in the difference in vector size as you trained the decision tree with different length features in comparison to the test data.
When performing testing, you must transform the data with the same model that was used for training. Therefore, don't call fit_transform on the testing data - just use transform instead to use the already created model:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.transform(rev_test) # Change here
I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)
Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)
You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)
I'm building a Machine Learning model using Pandas, but having a hard time applying my model to test data inputted by the user. My data is basically a dataframe with 2 columns: text and sentiment. I want to be able to predict the sentiment that the user inputs. Here's what I do:
1. Training/testing model
# reading dataset
df = pd.read_csv('dataset/dataset.tsv', sep='\t')
# splitting training/test set
test_size = 0.1
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['sentiment'], test_size=test_size)
# label encode the target variable (i.e. negative = 0, positive = 1)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
# function to train the model
def train_model(classifier, feature_vector_train, label, feature_vector_valid, name):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# save the trained model in the "models" folder
joblib.dump(classifier, 'models/' + name + '.pkl')
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
return metrics.accuracy_score(predictions, valid_y)
# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count, 'NB-COUNT')
print("NB, Count Vectors: ", accuracy)
Everything works fine, accuracy of about 80%
2. Testing the model on user input
Then I read the saved model again, get the user input and try to make a prediction (the user input is hardcoded right now in input_text):
clf = joblib.load('models/NB-COUNT.pkl')
dataset_df = pd.read_csv('dataset/dataset.tsv', sep='\t')
input_text = 'stackoverflow is the best' # the sentence I want to predict the sentiment for
test_df = pd.Series(data=input_text)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(dataset_df['text']) # fit the count vectorizer again so we can extract features from test_df
features = count_vect.transform(test_df)
result = clf.predict(features)[0]
print(result)
But the error I get is 'dimension mismatch':
Traceback (most recent call last):
File "C:\Users\vdvax\iCloudDrive\Freelance\09. Arabic Sentiment Analysis\test.py", line 20, in <module>
result = clf.predict(features)[0]
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 725, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T) +
File "C:\Python36\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Python36\lib\site-packages\scipy\sparse\base.py", line 515, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
You're getting the dimension mismatch error because the output of the CountVectorizer transformation does not match in dimensions to the expected shape in the fit estimator. This is due to the fact that you're fitting a separate CountVectorizer on your test data.
Scikit-learn provides a handy interface called a Pipeline that will allow you to stack your pre-processors and estimator together in a single estimator class. You should put all of your transformers into a Pipeline before your estimator, and then your test data will be transformed by the pre-fit transformer classes. Here's how you could fit a pipelined version of your estimator:
from sklearn.pipeline import Pipeline
# takes a list of tuples where the first arg is the step name,
# and the second is the estimator itself.
pipe = Pipeline([
('cvec', CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')),
('clf', naive_bayes.MultinomialNB())
])
# you can fit a pipeline in the same way you would any other estimator,
# and it will go sequentially through every stage
pipe.fit(train_x, train_y)
# you can produce predictions by feeding your test data into the pipe
pipe.predict(test_x)
Note that you don't have to create numerous copies of your data in various stages of pre-processing this way either, since the output of one stage is fed directly into the next stage.
Now, for your persistence problem. Pipelines can be persisted in the same way as other models:
joblib.dump(pipe, 'models/NB-COUNT.pkl')
loaded_model = joblib.load('models/NB-COUNT.pkl')
loaded_model.predict(test_df)