I am using the sklearn wrapper for xgboost. I would like to generate a plot of AUC for both my train and test samples for each iteration as shown in the plot below.
In sklearn you can use warm_start to iterate one at a time so you can easily stop to evaluate performance. Is there a way to do the same thing using the xgboost sklearn wrapper?
Related
I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link
GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.
I'm trying to run a multinomial LogisticRegression in sklearn with a clustered dataset (that is, there are more than 1 observations for each individual, where only some features change and others remain constant per individual).
I am aware in statsmodels it is possible to account for this the following way:
mnl = MNLogit(x,y).fit(cov_type="cluster", cov_kwds={"groups": cluster_groups)
Is there a way to replicate this with the sklearn package instead?
In order to run multinomial Logistic Regression in sklearn, you can use the LogisticRegression module and then set the parameter multi_class to multinomial.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I am trying to compare the validation set performance of an ensemble classifier with the individual predictors that make up the ensemble.
I've been following the code for Exercise 8 from this notebook to build a hard VotingClassifier with a LinearSVC, RandomForestClassifier, ExtraTreesClassifier, and MLPClassifier for version 1 of the MNIST Digits dataset using sklearn's fetch_openml API.
I trained the ensemble and evaluated it by calling its score function with validation data, and got a score of 0.97. So I'm certain the ensemble and, by extension, the individual predictors have been trained/fit.
But when I try using list comprehension to call score on the individual fitted estimators_ in this ensemble, like so
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]
I always get a result of 0.0 for each predictor, even if I evaluate on the training data.
I've confirmed the sub-estimators in estimators_ have been fit using the predict method as described in this StackOverflow post.
I have also trained the same estimators individually and evaluated them with the same method. This seems to work as scores are similar to the ones in the tutorial notebook.
Am I referencing the wrong list of sub-estimators in the ensemble object?
You can try adding
mnist.target = mnist.target.astype(np.uint8)
after loading the MNIST dataset.
It works for me.
I'm trying to understand how Perceptron from sklearn.linear_model performs fit() function (Documentation). Question comes from this piece of code:
clf = Perceptron()
clf.fit(train_data, train_answers)
print('accuracy:', clf.score(train_data, train_answers))
accuracy: 0.7
I thought goal of fitting is to create classification function which will give answer with 100% accuracy on test data, but in the example above it gives only 70%. I have tried one more data set where accuracy was 60%.
What do I misunderstand in fitting process?
It depends on your training data pattern distribution. In the graph shown below, could you find a straight line to separate blue and red? Obviously not, and this is the point. Your data must be linearly separable for
Perceptron Learning Algorithm to achieve 100% accuracy on training data. Otherwise, no straight line can separate it perfectly.
Is there any way to visualize svm model in Opencv using matplotlib in python like this one http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html .
In order to visualize decision boundary of SVM your data has to be two dimensional. If this is a case, you can just use scikit-learns code, and just substitute call to .predict with analogous predict from your own library (like opencv).