extract feature names from trained model - python

I have a pre-trained XGBoost model read from a pickle file. When I was trying to make predictions on a new dataset with some columns outside of the feature set of the model, I received the error message:
training data did not have the following fields: column1, column2,...
I am okay with excluding these columns not existing in training data. Instead of hard-coding the column names (there are many), I would like to just find the intersection between columns of the training and the prediction datasets.
Is there a way I can extract the feature names from the trained model (apparently the model recorded the field names) without having to go back to my training dataset?

You can retrieve feature names from a pickled model as follows:
fitted_model.get_booster().feature_names

Its mandatory that prediction dataset should contain only those columns which are present in training dataset. It even makes sense not to include extra columns because the weights are learnt based on your training dataset. Including any extra column apart from training dataset doesn't provide any value or improve your accuracy, because when you are predicting all you do is multiply the learnt weights of model with new values. Make sure not to inlcude any extra feature for predicting.

Related

How to solve mismatch in train and test set after categorical encoding?

I have many categorical variables which exist in my test set but don't in my train set. They are important so I can't drop them. Should I combine train and test set or what other solution should I make?
You have some options in this case, you can use another technique than Holdout to separate your data like a K-Fold Cross-validation or Leave-one-out.
When using a holdout is necessary to stratify your data to all subsets have all classes on train/test/validation subset's and NEVER use your test or validation dataset to fit your model, when you do it the model will learn this data and you probably will be overfitting your model, read more about it here
How did you end up in this situation? Normally you take a dataset and divide it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
But it is clear that, because they originate from the same original dataset, that they both have the same categorical variables.

Prediction using a model in machine learning python

Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.
You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.

How to retain Scikit-learn OneHotEncoding from model generation to use on new data?

I'm using OneHotEncoding to generate dummies for a classification problem. When used on the training data, I get ~300 dummy columns, which is fine. However, when I input new data (which is fewer rows), the OneHotEncoding only generates ~250 dummies, which isn't surprising considering the smaller dataset, but then I can't use the new data with the model because the features don't align.
Is there a way to retain the OneHotEncoding schema to use on new incoming data?
I think you are using fit_transform on both training and test dataset, which is not the right approach because the encoding schema has to be consistent on both the dataset for the model to understand the information from the features.
The correct way is do
fit_transform on training data
transform on test data
By doing this way, you will get consistent number of columns.

Catboost: how do I pass cat_features to a saved model in Python?

I have this pre-trained saved model, where I informed my categorical features, and it's working fine if I predict right after training. Now I wanna use it again in another context but I don't know how to properly inform the categorical features. I tried this:
model = CatBoostClassifier(cat_features=var_categ)
model.load_model('catmod.cat')
but when I try to predict:
modelo.predict(base)
I get this error:
CatBoostError: features data: pandas.DataFrame column 'cod_var1' has dtype 'category' but is not in cat_features list
Yes, I double checked the column is in var_categ.
First of all, you don't need to specify catboost classifier cat_features because the model already has this information from load_model.
I would guess from your error that when you use predict on the new data set, your features shifted by 1 location thus giving you the error.
Without seeing the code used to process data both for training and predicting, there's not quite enough to go on. The error means that when the model was trained, 'cod_var1' was not in the categorical features list. It may be in var_categ, but the model is indicating that it was not in the categorical features list used to train the model.
In your dataset base, cod_var1 is a "category" dtype. Since this is a CatBoost-specific dtype (not one that would automatically be set by pandas on dataframe creation), it appears you have some code between data loading and predicting that sets the dtype. I'd hypothesize that something changed in those data processing steps between when you trained it and now such that the prediction isn't exactly the same (same columns in the same order with the same types).

Retrieve list of training features names from classifier

Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.
I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.
You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.
You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names

Categories

Resources