Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.
You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.
Related
I am doing a model training of CNN in Python and I have a question. I know that data normalization is important to scale the data in my dataframe between 0 and 1, but let's say I perform z-score normalization on my dataframe VERTICALLY (which means scale the data within the scope of each feature), but after I deployed the model and want to use it on real world scenarios, I only have one row of data in my dataframe (but with same amount of features), I am not able to perform normalization anymore because there is only one data for each feature. The standard deviation will be 0 and division of 0 in z-score is not applicable.
I want to confirm that do I still need to perform data normalization on real world scenarios? If I do not need to, will the result differs because I did normalization during my model training?
If you are using a StandardScaler from scikit-learn. You need to save the scaler object and use it for transforming new data after deployment.
I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?
I have this pre-trained saved model, where I informed my categorical features, and it's working fine if I predict right after training. Now I wanna use it again in another context but I don't know how to properly inform the categorical features. I tried this:
model = CatBoostClassifier(cat_features=var_categ)
model.load_model('catmod.cat')
but when I try to predict:
modelo.predict(base)
I get this error:
CatBoostError: features data: pandas.DataFrame column 'cod_var1' has dtype 'category' but is not in cat_features list
Yes, I double checked the column is in var_categ.
First of all, you don't need to specify catboost classifier cat_features because the model already has this information from load_model.
I would guess from your error that when you use predict on the new data set, your features shifted by 1 location thus giving you the error.
Without seeing the code used to process data both for training and predicting, there's not quite enough to go on. The error means that when the model was trained, 'cod_var1' was not in the categorical features list. It may be in var_categ, but the model is indicating that it was not in the categorical features list used to train the model.
In your dataset base, cod_var1 is a "category" dtype. Since this is a CatBoost-specific dtype (not one that would automatically be set by pandas on dataframe creation), it appears you have some code between data loading and predicting that sets the dtype. I'd hypothesize that something changed in those data processing steps between when you trained it and now such that the prediction isn't exactly the same (same columns in the same order with the same types).
I have a pre-trained XGBoost model read from a pickle file. When I was trying to make predictions on a new dataset with some columns outside of the feature set of the model, I received the error message:
training data did not have the following fields: column1, column2,...
I am okay with excluding these columns not existing in training data. Instead of hard-coding the column names (there are many), I would like to just find the intersection between columns of the training and the prediction datasets.
Is there a way I can extract the feature names from the trained model (apparently the model recorded the field names) without having to go back to my training dataset?
You can retrieve feature names from a pickled model as follows:
fitted_model.get_booster().feature_names
Its mandatory that prediction dataset should contain only those columns which are present in training dataset. It even makes sense not to include extra columns because the weights are learnt based on your training dataset. Including any extra column apart from training dataset doesn't provide any value or improve your accuracy, because when you are predicting all you do is multiply the learnt weights of model with new values. Make sure not to inlcude any extra feature for predicting.
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.
I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.
You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.
You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names