How to use multiclassification model to make predicitions in entire dataframe - python

I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!

These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.

Related

Building a basic prediction model with the output being the sum of the two inputs but accuracy score is significantly low

I have a csv of size 12500 X 3. The first two columns (A and B) are inputs and the the final column (C) is the sum of the two columns.
I wanted to build a prediction model to get the value of C for a given A and B. This is just a basic model to imporve my understanding of machine learning.
The accuracy score is almost zero (0.00032) and the model is way to simple to get the predictions wrong. The code is below:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('Dataset.csv') #importing dataset
X = data.drop(columns=['C'])
y = data['C']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score
I did not even include outlier into the data and I create the csv using excel formulae. I used jupyter notebook to build this prediction model. Can someone please point out if/what I'm doing wrong?
Before you build your model, you should understand the behavior of the model and its main function. Decision Tree is used to classify data based on the criterias extracted from data. For this purpose, you should just choose the simple Linear Regression model, not the Decision Tree.

Pre train a model (classifier) in scikit learn

I would like to pre-train a model and then train it with another model.
I have model Decision Tree Classifer and then I would like to train it further with model LGBM Classifier. Is there a possibility to do this in scikit learn?
I have already read this post about it https://datascience.stackexchange.com/questions/28512/train-new-data-to-pre-trained-model.. In the post it says
As per the official documentation, calling fit() more than once will
overwrite what was learned by any previous fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Train Decision Tree Classifer
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
lgbm = lgb.LGBMClassifier()
lgbm = lgbm.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = lgbm.predict(X_test)
Perhaps you are looking for stacked classifiers.
In this approach, the predictions of earlier models are available as features for later models.
Look into StackingClassifiers.
Adapted from the documentation:
from sklearn.ensemble import StackingClassifier
estimators = [
('dtc_model', DecisionTreeClassifier()),
]
clf = StackingClassifier(
estimators=estimators,
final_estimator=LGBMClassifier()
)
Unfortunately this is not possible at present. According to the doc at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html?highlight=init_model, you can continue training the model if the model is from lightgbm.
I did try this setup with:
# dtc
dtc_model = DecisionTreeClassifier()
dtc_model = dtc_model.fit(X_train, y_train)
# save
dtc_fn = 'dtc.pickle.db'
pickle.dump(dtc_model, open(dtc_fn, 'wb'))
# lgbm
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train_2, y_train_2, init_model=dtc_fn)
And I get:
LightGBMError: Unknown model format or submodel type in model file dtc.pickle.db
As #Ferdy explained in his post, there is no simple way to perform this operation and it is understandable.
Scikit-learn DecisionTreeClassifier takes only numerical features and cannot handle nan values whereas LGBMClassifier can handle those.
By looking at the decision function of scikit-learn you can see that all it can perform is splits based on feature <= threshold.
On the contrary LGBM can perform the following:
feature is na
feature <= threshold
feature in categories
Splits in decision tree are selected at each step as they best splits the set of items. They try to minimize the node impurity (giny) or entropy.
The risk of further training a DecisionTreeClassifier is that you are not sure that splits performed in the original tree are the best, since you have new splits capabilities with LGBM that might/should lead in better performance.
I would recommend you to retrain the model with LGBMClassifier only as it might be possible that splits will be different from the original scikit-learn Tree.

How to make prediction on the new data in Pandas DataFrame with some extra columns?

I have used random forest classifier to build a model - model works fine I am able to output score as well as probability value on the train and test .
The challenge is :
I used 29 variables as features with 1 Target
When I score the X_Test it works fine
When I bring in a new data set which has 29 variables and my Unique ID /primary key - model errors out saying its looking for 29 variables
How do I retain my ID and get prediction for the new file ?
What I tried so far -
data = pd.read_csv('learn2.csv')
y=data['Target'] # Labels
X=data[[
'xsixn', 'xssocixtesDegreeOnggy', 'xverxgeeeouseeeoggdIncome', 'BxceeeggorsDegreeOnggy', 'Bggxckorxfricxnxmericxn',
'Ceeiggdrenxteeome', 'Coggggege', 'Eggementxry', 'GrxduxteDegree', 'eeigeeSceeoogg', 'eeigeeSceeooggGrxduxte', 'eeouseeeoggdsEst',
'MedixneeouseeeoggdIncome', 'NoVeeeicgges', 'Oteeerxsixn', 'OteeersRxces', 'OwnerOccupiedPercent', 'PercentBggueCoggggxrWorkers',
'PercentWeeiteCoggggxr', 'PopuggxtionEst', 'PopuggxtionPereeouseeeoggd', 'RenterOccupiedPercent', 'RetiredOrDisxbggePersons',
'TotxggDxytimePopuggxtion', 'TotxggStudentPopuggxtion', 'Unempggoyed', 'VxcxnteeousingPercent', 'Weeite', 'WorkpggxceEstxbggiseements'
]]
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% training
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
Predicting on new file:
data1=pd.read_csv('score.csv')
y_pred2=clf.predict(data2)
ValueError: Number of features of the model must match the input. Model n_features is 29 and input n_features is 30
You can exclude the 'ID' column while generating the predictions on new dataset using pandas difference function:
data1=pd.read_csv('score.csv')
For ease of further use I am storing the predictions in a new dataframe:
y_pred2 = pd.DataFrame(clf.predict(data1[data1.columns.difference(['ID'])]),columns = ['Predicted'], index = data1.index)
To map the predictions against the 'ID' use pd.concat:
pred = pd.concat([data1['ID'], y_pred2['Predicted']], axis = 1)

Format of train/test for random forest classifier with categorical variables

Updated: How do I set up my train/test df for scikit randomforestclassifier for multiple categories? How do I predict?
My training dataset has a categorical Outcome column with 4 classes and I want to predict which of those four is most likely for my test data. Looking at other questions, I tried use pandas get_dummies to encode four new columns into the original df in place of the original Outcome column but wasn't sure how to indicate to the classifier that those four columns were the categories, so I used y = df_raw['Outcomes'].values .
I then split the training set 80/20 and called fit() with these x_train, x_valid and y_train, y_valid:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 10000
n_trn = len(df_raw_dumtrain)-n_valid
raw_train, raw_valid = split_vals(df_raw_dumtrain, n_trn)
X_train, X_valid = split_vals(df_raw_dumtrain, n_trn)
y_train, y_valid = split_vals(df_raw_dumtrain, n_trn)
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, y_train)
Y_prediction = random_forest.predict(X_train)
I tried running fit() as:
test_pred = random_forest.predict(df_test)
But I get an error:
ValueError: Number of features of the model must match the input.
Model n_features is 27 and input n_features is 28
How should I be configuring my test set?
You have to remove the target variable from the test data and then give the remaining column of the dataframe as the input for the prediction function. You would able to solve the number of features mismatch.
Try this!
random_forest.predict(df_test.drop('Outcomes',axis=1))
Note : you don't have to create dummy variables of the target variables for using the random forest or any decision tree based models.

SKLearn Predicting using new Data

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?
You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.
You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)

Categories

Resources