F1/F0.5 score as eval_metric in XGBClassifier - python

I'm performing a classification task using XGBClassifier - I want to reuse sklearn's functionalities as much as possible. Especially I'm interested in defining my custom scorer using f_beta function to define f0.5 score.
When I run the following code:
from sklearn.metrics import f1_score
clf = xgb.XGBClassifier(max_depth=5,
learning_rate=0.25,
objective='binary:logistic',
use_label_encoder=False,
eval_metric=make_scorer(fbeta_score(beta=0.5)),
)
I get the following error:
TypeError: fbeta_score() missing 2 required positional arguments: 'y_true' and 'y_pred'
Also, following this part of XGBoost documentation I simplified the case just to use a predefined, ready f1_score metric: eval_metric=f1_score but XGBClassifier switches back to log-loss one.
How can I implement my customised metric in the appropriate way?

if you check the documentation you cannot use with eval_metric creating you own metric but the one listed in the documentation
But if you want to optimize I think you can precise a custom metric in gridsearchCV with scoring

Related

xgboost and gridsearchcv in python

I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link
GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.

Sklearn - fit, scale and transform

The fit() method in sklearn appears to be serving different purposes in same interface.
When applied to the training set, like so:
model.fit(X_train, y_train)
fit() is used to learn parameters that will later be used on the test set with predict(X_test)
However, there are cases when there is no 'learning' involved with fit(), but only some normalization to transform the data, like so:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(X_train)
which will simply scale feature values between, say, 0 and 1, to avoid some features with higher variance to have a disproportional influence on the model.
To make things even less intuitive, sometimes the fit() method that scales (and already appears to be transforming) needs to be followed by further transform() method, before being called again with the fit() that actually learns and builds the model, like so:
X_train2 = min_max_scaler.transform(X_train)
X_test2 = min_max_scaler.transform(X_test)
# the model being used
knn = KNeighborsClassifier(n_neighbors=3,metric="euclidean")
# learn parameters
knn.fit(X_train2, y_train)
# predict
y_pred = knn.predict(X_test2)
Could someone please clarify the use, or multiple uses, of fit(), as well as the difference of scaling and transforming the data?
fit() function provides a common interface that is shared among all scikit-learn objects.
This function takes as argument X ( and sometime y array to compute the object's statistics. For example, calling fit on a MinMaxScaler transformer will compute its statistics (data_min_, data_max_, data_range_...
Therefore we should see the fit() function as a method that compute the necessary statistics of an object.
This commons interface is really helpful as it allows to combine transformer and estimators together using a Pipeline. This allows to compute and predict all steps in one go as follows:
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
X, y = make_classification(n_samples=1000)
model = make_pipeline(MinMaxScaler(), NearestNeighbors())
model.fit(X, y)
This offers also the possibility to serialize the whole model into one single object.
Without this composition module, I can agree with you that it is not very practically to work with independent transformer and estimator.
In scikit-learn there are 3 classes that share interface: Estimators, Transformers and Predictors
Estimators have fit() function, which serves always the same purpose. It estimates parameters based on the dataset.
Transformers have transform() function. It returns the transformed dataset. Some Estimators are also Transformers, e.g. MinMaxScaler()
Predictors have predict() function, which returns predictions on new instances, e.g. KNeighborsClassifier()
Both MinMaxScaler() and KNeighborClassifier() contain fit() method, because they share interface of an Estimator.
However, there are cases when there is no 'learning' involved with fit()
There is 'learning' involved. Transformer, MinMaxScaler() has to 'learn' min and max values for each numerical feature.
When you call min_max_scaler.fit(X_train) your scaler estimates values for each numerical column in your train set. min_max_scaler.transform(X_train) scales your train set based on the estimations. min_max_scaler.transform(X_test) scales the test set with the estimations learned for train set. This is important to scale both train and test set with the same estimations.
For further reading, you can check this: https://arxiv.org/abs/1309.0238

How to get most important feature coefficients when i used pipeline to preprocess, train and test the linear svc?

I am using a LinearSVC, i pre-processed the numeric and categorical data using column transformer,then used pipeline. I used GridSearchCV to get the best parameters for the model which i later put into the pipeline as you can see.
I fit,tested and got the score as well but i want to know the most important feature coefficients.
So far, i have tried " clf.coef_ " as the classifier step is named as clf in the pipeline but i get a message saying clf not defined.
I also tried gridf.coef_,pipefinal.steps[1].coef_ but nothing worked.
So any help in this regard will be highly appreciated. Thanks.
preprocessing=ColumnTransformer([('hot',OneHotEncoder(),categ),('scale',StandardScaler(),num)],n_jobs=-1)
pipefinal=Pipeline([('pre',preprocessing),('clf',LinearSVC(max_iter=100000,C=0.1))])
gridf=GridSearchCV(pipefinal,param_grid={},cv=10)
gridf.fit(X_train,y_train)
gridf.score(X_val,y_val)
GridSearchCV will make the best estimator available through its best_estimator_ attribute after you have called the fit() method. Since your estimator is a Pipeline object, you have to further subscript it to access the classifier. Then, you can access its coef_ attribute. In your case, that would be:
gridf.best_estimator_['clf'].coef_

cross_val_score behaves differently with different classifiers in sklearn

I'm having some difficulty with cross_val_score() in sklearn.
I have instantiated a KNeighborsClassifier with the following code:
clf = KNeighborsClassifier(n_neighbors=28)
I am then using cross validation to understand the accuracy of this classifier on my df of features (x) and target series (y) with the following:
cv_score_av = np.mean(cross_val_score(clf, x, y, cv=5))
Each time I run the script I was hoping to achieve a different result, however there is not an option to set random_state=None like there is with RandomForestClassifier() for example. Is there a way to achieve a different result with each run or am I going to have to manually shuffle my data randomly prior to running cross_val_score on my KNeighborsClassifier model.
There seems to be some misunderstanding here from your part; the random_state argument in the Random Forest refers to the algorithm itself, and not to the cross validation part. Such an argument is necessary here, since RF includes indeed some randomness in model building (a lot of it, in fact, as already implied by the very name of the alforithm); but knn, in contrast, is a deterministic algorithm, so in principle there is no need for it to use any random_state.
That said, your question is indeed valid; I have commented in the past on this annoying and inconvenient absence of a shuffling argument in cross_val_score. Digging into the documentation, we see that under the hood, the function uses either StratifiedKFold or KFold to build the folds:
cv : int, cross-validation generator or an iterable, optional
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
and both of these functions, as you can easily see from the linked documentation pages, use shuffle=False as default value.
Anyway, the solution is simple, consisting of a single additional line of code; you just need to replace cv=5 with a call to a previously defined StratifiedKFold object with shuffle=True:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
cv_score_av = np.mean(cross_val_score(ml_10_knn, x, y, cv=skf))

Pyspark Linear Regression Gradient Descent CrossValidation

I am attempting to perform cross validation on a SGD model in pyspark, I am working with LinearRegressionWithSGD from pyspark.mllib.regression , ParamGridBuilder and CrossValidator both from the pyspark.ml.tuning libraries.
After following documentation from the Spark website, I was hoping running this would work
lr = LinearRegressionWithSGD()
pipeline=Pipeline(stages=[lr])
paramGrid = ParamGridBuilder()\
.addGrid(lr.stepSize, Array(0.1, 0.01))\
.build()
crossval = CrossValidator(estimator=pipeline,estimatorParamMaps= paramGrid,
evaluator=RegressionEvaluator(),
numFolds=10)
But LinearRegressionWithSGD() does not have the attributes stepSize (tried others with no luck either).
I can set lr to LinearRegression but then I am unable to use SGD for my model and cross validate.
There is the kFold method within scala but I am not sure how to access that from pyspark
You can use the step parameter from LinearRegressionWithSGD to define your step size but that will not allow your code to work because you are mixing incompatible libraries. Unfortunately, I do not know how to do cross validation with the ml library using SGD optimization and I would like to know myself but you are mixing the libraries pyspark.ml and pyspark.mllib. Specifically you cannot use LinearRegressionWithSGD with the pyspark.ml library. You have to use pyspark.ml.regression.LinearRegression.
The good news is you can set the set the solver attribute of pyspark.ml.regression.LinearRegression to use 'gd'. Therefore, you can probably set the parameters of the 'gd' optimizer run as SGD, but I am not sure where the solver documentation is or how to set the solver attributes (e.g. the batch size). The api shows the LinearRegression object calling Param() but I am not sure if it is using the pyspark.mllib optimizer. If anyone knows how to set the solver attributes, that could answer your question by allowing you to use the Pipeline, ParamGridBuilder, and CrossValidation ml packages for model selection with LinearRegression utilizing SGD optimization for parameter tuning.
Respectfully,
Shane

Categories

Resources