Is there a way to retrieve from the fitted xgboost object the hyper-parameters used to train the model. More specifically, I would like to know the number of estimators (i.e. trees) used in the model. Since I am using early stopping, the n_estimator parameter would not give me the resulting number of estimators in the model.
If you are trying to get the parameters of your model:
print(model.get_xgb_params())
model.get_params(deep=True) should show n_estimators
Then use model.get_xgb_params() for xgboost specific parameters.
Related
I am trying to improve my classification model, using statsmodel in LogisticRegression i note that some features that didn't pass in t test and don't have many influency when i use this model are very important when i change the model, for example i looked up to feature_importances of a RandomForestClassifier and the more important feature did not influence LogisticRegression.
With this in mind, i thought to use LogisticRegression without this feature and use the predict_proba to pick the probabilities, then i create another model using RandomForest but now using all features and including the logisticRegressor probabilities. Or i can pick all probabilities of many models and use them as features of another model.. Anything of This make sense? I dont know if i am inserting any bias doing this and why.
I found that what I was doing was stacking, but instead of using another model's response as a feature, I was using the probability of being 1 (predict_proba).
I made the randomforest classifier model.
It has about 100 X variables(features) to classify Y(angle).
I want to give weights to some important features.
How can I do this?
You can use the class_weight parameter to achieve this, if you are using sklearn.ensemble.RandomForestClassifier in the form of
class_weight = {class_label_0: class_weight_0,
class_label_1: class_weight_1}
following that important features gets higher weights than other that aren't that important.
You can check this more detailed in SKLearn documentation in the parameters section of class weight.
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.
I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.
You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.
You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names
I am building a Random Forest model using a grid search with the H2O Python API. I split the data in train and validation and use k-fold cross validation to select the best model in the grid search.
I am able to retrieve the model with the best MSE on the training set but I want to retrieve the model with the highest AUC on the validation set.
I could code everything in Python but I was wondering whether there is a H2O approach to solve this. Any suggestions on how I could do this?
If g is your grid object, then:
g.sort_by('auc', False);
will give you the models ordered by AUC. The 2nd parameter of False means highest AUC will be first. It returns a H2OTwoDimTable object, so you can select the first model (the best model, by AUC) that way.
I believe it should be sorting based on scores on the validation set, not training set. However you can specify it explicitly with:
g.sort_by('auc(valid=True)', False);
GridSearchCV implements a fit method in which it performs n-fold cross validation to determine best parameters. After this we can directly apply the best estimator to the testing data using predict() - Following this link : - http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
It says here "The model is trained on the full development set"
However we have only applied n fold cross validations here. Is the classifier somehow also training itself on the entire data? or is it just choosing the best trained estimator with best parameters amongst the n-folds when applying predict?
If you want to use predict, you'll need to set 'refit' to True. From the documentation:
refit : boolean
Refit the best estimator with the entire dataset.
If “False”, it is impossible to make predictions using
this GridSearchCV instance after fitting.
It looks like it is true by default, so in the example, predict is based on the whole training set.