Prediction on new data after pickling on regression model

Prediction on new data after pickling on regression model - python

I have created ridge regression model to predict sales of an item say X. My final model contains around 180 features. I have pickled this model and now I want to see how model I created is performing on new data set containing same features as present in different model but on different timeframe.
I have to pass entire new dataset(dataframe) into existing model and check relevant score say r-square or any other score relevant to regression model.
Below is the code I'm using:
# loading library
import pickle
with open('ridge_model', 'wb') as p:
pickle.dump(ridge, p)
y_test = df.pop('Target')
x_test = df
# load saved model
with open('ridge_model' , 'rb') as p:
new_reg = pickle.load(p)
r_squared = new_reg.score(x_test,y_test)
print(r-squared)
Here ridge is the regression model I created
df is the new data on which prediction needs to be done
y_test contains target variable
x_test contains features in dataset except target variable
I want to understand:
Can we pass entire new data set onto existing model for prediction and
see how model is performing or not?
r_squared = new_reg.score(x_test,y_test) in this line does .score calculates r-square as
calculated on existing model or any other score it calculates?

Related

ValueError: X has 2 features, but SVC is expecting 472082 features as input

I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?

The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.

unable to make prediction on test data set using pyspark ML library

I am working on a flight ticket price prediction data set using pyspark ML lib, which contains both train and test data sets. I have successfully implemented my model on train data set and predicted the price i.e, the label column, but don't know how to apply the same model on test data set for predicting the price of the ticket.
The following code is for training the model on train data set(containing both features and label column).
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol="features",labelCol = "Price", maxIter = 10)
gbtModel = gbt.fit(training_data)
predictions_gbt = gbtModel.transform(testing_data)
predictions_gbt.select("features", "Price", "prediction").show()

How to Save and Load a Sklearn Model in Different Files

Is there any possible way to train a model in a different file with a different list and use prediction with another list? This does not work because vectorizer is not defined. I do not want to re train the data every time I run my program. Is it possible to save the model and just load the model in another file and predict the data with another list?
document = df.stack().tolist()
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)
true_k = 20
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
joblib.dump(model, 'model.joblib')
model = joblib.load('model.joblib')
documentnew = df.stack().tolist()
print("\n")
print("Prediction")
X = vectorizer.transform([documentnew[2]])
predicted = model.predict(X)
When I load the model and I transform the documentnew, i get this error.
ValueError: Incorrect number of features. Got 7 features, expected 39
The approach above works if its in the same file, but I dont want the model to change and train everytime I run the program.

ValueError: Number of features of the model must match the input (sklearn)

I am trying to run a classifier on some movie review data. The data had already been separated into reviews_train.txt and reviews_test.txt. I then loaded the data in and separated each into review and label (either positive (0) or negative (1)) and then vectorized this data. Here is my code:
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
#read the reviews and their polarities from a given file
def loadData(fname):
reviews=[]
labels=[]
f=open(fname)
for line in f:
review,rating=line.strip().split('\t')
reviews.append(review.lower())
labels.append(int(rating))
f.close()
return reviews,labels
rev_train,labels_train=loadData('reviews_train.txt')
rev_test,labels_test=loadData('reviews_test.txt')
#vectorizing the input
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors_train, labels_train)
#prediction
pred=clf.predict(vectors_test)
#print accuracy
print (accuracy_score(pred,labels_test))
However I keep getting this error:
ValueError: Number of features of the model must match the input.
Model n_features is 118686 and input n_features is 34169
I am pretty new to Python so I apologize in advance if this is a simple fix.

The problem is right here:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
You call fit_transform on both the training and testing data. fit_transform simultaneously creates the model stored in vectorizer then uses the model to create the vectors. Because you call it twice, what's happening is that vectors_train is first created and the output feature vectors are generated then you overwrite the model with the second call to fit_transform with the test data. This results in the difference in vector size as you trained the decision tree with different length features in comparison to the test data.
When performing testing, you must transform the data with the same model that was used for training. Therefore, don't call fit_transform on the testing data - just use transform instead to use the already created model:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.transform(rev_test) # Change here

using saved sklearn model to make prediction

I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)

Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)

You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prediction on new data after pickling on regression model - python

Related

ValueError: X has 2 features, but SVC is expecting 472082 features as input

unable to make prediction on test data set using pyspark ML library

How to Save and Load a Sklearn Model in Different Files

ValueError: Number of features of the model must match the input (sklearn)

using saved sklearn model to make prediction

Categories

Resources