I am dealing with multi-label classification with OneVsRestClassifier and SVC,
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
L=3
X, y = make_multilabel_classification(n_classes=L, n_labels=2,
allow_unlabeled=True,
random_state=1, return_indicator=True)
model_to_set = OneVsRestClassifier(SVC())
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
print model_tunning.best_score_
print model_tunning.best_params_
#0.855175822314
#{'estimator__kernel': 'poly', 'estimator__C': 1, 'estimator__degree': 3}
1st question
What is the number 0.85 representing for? Is it the best score among the L classifiers or the averaged one? Similarly, does the set of parameters stand for the best-scorer among L classifiers?
2nd question
Based on the fact that, if I am right, the OneVsRestClassifier literally builds L classifiers for each label, one can expect to access or observe the performance of EACH LABEL. But how, in the above example, to obtain L scores from the GridSearchCV object?
EDIT
To simplify the problem and help myself understand more about OneVsRestClassifier, before tuning model,
model_to_set.fit(X,y)
gp = model_to_set.predict(X) # the "global" prediction
fp = model_to_set.estimators_[0].predict(X) # the first-class prediction
sp = model_to_set.estimators_[1].predict(X) # the second-class prediction
tp = model_to_set.estimators_[2].predict(X) # the third-class prediction
It can be shown that gp.T[0]==fp, gp.T[1]==sp and gp.T[2]==tp. So the "global" prediction is simply the 'sequential' L individual predictions and the 2nd question is solved.
But it is still confusing for me that if one meta-classifier OneVsRestClassifier contains L classifiers, how can GridSearchCV returns only ONE best score, corresponding to one of 4*2*4 sets of parameters, for a meta-classifier OneVsRestClassifier having L classifiers?
It would be fairly appreciated to see any comment.
GridSearchCV creates grid from your parameter values, it evaluates your OneVsRestClassifier as atomic classifier (I.e. GridSearchCV doesn't know what is inside this metaclassifier)
First: 0.85 is the best score of OneVsRestClassifier among all possible combinations (16 combinations in your case, 4*2*4) of parameters ("estimator__C", "estimator__kernel", "estimator__degree"), it means that GridSearchCV evaluates 16 (again, it's only in this particular case) possible OneVsRestClassifier's each of which contains L SVC's. All of that L classifiers inside one OneVsRestClassifier have same values of parameters (but each of them is learning to recognize their own class from L possible)
i.e. from set of
{OneVsRestClassifier(SVC(C=1, kernel="poly", degree=1)),
OneVsRestClassifier(SVC(C=1, kernel="poly", degree=2)),
...,
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=3)),
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=4))}
it chooses one with the best score.
model_tunning.best_params_ here represents parameters for OneVsRestClassifier(SVC()) with which it will achieve model_tunning.best_score_.
You can get that best OneVsRestClassifier from model_tunning.best_estimator_ attribute.
Second: There is no ready to use code to obtain separate scores for L classifiers from OneVsRestClassifier, but you can look at implementation of OneVsRestClassifier.fit method, or take this (should work :) ):
# Here X, y - your dataset
one_vs_rest = model_tunning.best_estimator_
yT = one_vs_rest.label_binarizer_.transform(y).toarray().T
# Iterate through all L classifiers
for classifier, is_ith_class in zip(one_vs_rest.estimators_, yT):
print(classifier.score(X, is_ith_class))
Inspired by #Olologin 's answer, I realized that 0.85 is the best weighted average of f1 scores (in this example) obtained by L predictions. In the following code, I evaluate the model by inner test, using macro average of f1 score:
# Case A, inspect F1 score using the meta-classifier
F_A = f1_score(y, model_tunning.best_estimator_.predict(X), average='macro')
# Case B, inspect F1 scores of each label (binary task) and collect them by macro average
F_B = []
for label, clc in zip(y.T, model_tunning.best_estimator_.estimators_):
F_B.append(f1_score(label, clf.predict(X)))
F_B = mean(F_B)
F_A==F_B # True
So it implies that the GridSearchCV applies one of 4*2*4 sets of parameters to build the meta-classifier which in turn makes prediction on each label with one of the L classifiers. The outcome will be L f1 scores for L labels, each of which is a performance of a binary task. Finally, a single score is obtained by taking average (macro or weighted average, specified by parameter in f1_score) of L f1 scores.
The GridSearchCV then choose the best averaged f1 scores among 4*2*4 sets of parameters, which is 0.85 in this example.
Though it is convenient to use the wrapper for multi-label problem, it can only maximize the averaged f1 score with a same set of parameters used to build L classifiers. If one wants to optimize the performance of each label separately, one seems to have to build L classifiers without using the wrapper.
As for your second question, you might want to used GridSearchCV with scikit-multilearn's BinaryRelevance classifier. Like OneVsRestClassifier, Binary Relevance creates L single-label classifiers, one per label. For each label the training data is 1 if label is present and 0 if not present. The best selected classifier set is the BinaryRelevance class instance in best_estimator_ property of GridSearchCV. Use for predicting floats of probabilities use the predict_proba method of the BinaryRelevance object. An example can be found in the scikit-multilearn docs for model selection.
In your case I would run the following code:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.model_selection import GridSearchCV
import sklearn.metrics
model_to_set = BinaryRelevance(SVC())
parameters = {
"classifier__estimator__C": [1,2,4,8],
"classifier__estimator__kernel": ["poly","rbf"],
"classifier__estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
# for some X_test testing set
predictions = model_tunning.best_estimator_.predict(X_test)
# average=None gives per label score
metrics.f1_score(y_test, predictions, average = None)
Please note that there much better methods for multi-label classification than Binary Relevance :) You can find them in madjarov's comparison or my recent paper.
Related
I have used permutatation_importance to find which values are the most important
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.inspection import permutation_importance
columns=['progresion', 'tipo']
X = df_cat.drop(columns, axis = 1)
y = df_cat['progresion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state = 42)
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
results = permutation_importance(knn, X, y, scoring='accuracy')
importance = results.importances_mean
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
But what I want to do is evaluate the KNN classifier for each pair of variables to find which pair of variables is more relevant to achieve a better performance of the model.
kNN favors each independent variable (feature) the same. This makes it pretty difficult to isolate a feature using kNN or assign it a different weight.
Also since kNN is a non-parametric algorithm (it doesn't make any assumptions based on data), unlike Naive Bayes you can't get any meaningful probability output based on features.
In this case I would suggest taking a look at decision tree based algorithms such as random forests which inherently have a feature_importance_ as a builtin class in scikit-learn. This will give you the importance of each feature after implementing the model.
There is a great example here:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
Also RF feature_importance_ section here:
Random Forest feature_importances_
If you really want to go against the conventional wisdom and identify feature importance by using kNN algorithm one option can be to construct the model with different features and compare the overall accuracy later.
I know this may or may not be directly addressing your question. But it's what comes to my mind at the moment. Maybe there will be other answers with different angles than mine.
I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865
In the sklearn.model_selection.cross_val_predict page it is stated:
Generate cross-validated estimates for each input data point. It is
not appropriate to pass these predictions into an evaluation metric.
Can someone explain what does it mean? If this gives estimate of Y (y prediction) for every Y (true Y), why can't I calculate metrics such as RMSE or coefficient of determination using these results?
It seems to be based on how samples are grouped and predicted. From the user guide linked in the cross_val_predict docs:
Warning Note on inappropriate usage of cross_val_predict
The result of
cross_val_predict may be different from those obtained using
cross_val_score as the elements are grouped in different ways. The
function cross_val_score takes an average over cross-validation folds,
whereas cross_val_predict simply returns the labels (or probabilities)
from several distinct models undistinguished. Thus, cross_val_predict
is not an appropriate measure of generalisation error.
The cross_val_score seems to say that it averages across all of the folds, while the cross_val_predict groups individual folds and distinct models but not all and therefore it won't necessarily generalize as well. For example, using the sample code from the sklearn page:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
diabetes = datasets.load_diabetes()
X = diabetes.data[:200]
y = diabetes.target[:200]
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=3)
print("Cross Val Prediction score:{}".format(mean_squared_error(y,y_pred)))
print("Cross Val Score:{}".format(np.mean(cross_val_score(lasso, X, y, cv=3, scoring = make_scorer(mean_squared_error)))))
Cross Val Prediction score:3993.771257795029
Cross Val Score:3997.1789145156217
Just to add a little more clarity, it is easier to understand the difference if you consider a non-linear scoring function such as Maximum-Absolute-Error instead of something like a mean-absolute error.
cross_val_score() would compute the maximum-absolute-error on each off the 3-folds (assuming 3 fold cross-validator) and report the aggregate (say mean?) over 3 such scores. That is, something like mean of (a, b, c) where a , b, c are the max-abs-errors for the 3 folds respectively. I guess it is safe to conclude the returned value as the max-absolute-error of your estimator in the average or general case.
with cross_val_predict() you would get 3-sets of predictions corresponding to 3-folds and taking the maximum-absolute-error over the aggregate (concatenation) of these 3-sets of predictions is certainly not the same as above. Even if the predicted values are identical in both the scenarios, what you end up with here is max of (a, b,c ). Also, max(a,b,c) would be an unreasonable and overly pessimistic characterization of the max-absolute-error score of your model.
I am trying to find best K value for KNeighborsClassifier.
This is my code for iris dataset:
k_loop = np.arange(1,30)
k_scores = []
for k in k_loop:
knn = KNeighborsClassifier(n_neighbors=k)
cross_val = cross_val_score(knn, X, y, cv=10 , scoring='accuracy')
k_scores.append(cross_val.mean())
I have taken mean of cross_val_score in each loop and plotted it.
plt.style.use('fivethirtyeight')
plt.plot(k_loop, k_scores)
plt.show()
This is the result.
You can see the accuracy is higher when k is between 14 to 20.
1) How can I choose the best value of k.
2) Are there any other ways to calculate and find best value for K?
3) Any other improvement suggestions are also appreciated. I'm new to ML
Let's first define what is K?
K is the number of voters that the algorithm consult to make a decision about to which class a given data point it belongs to.
In other words, it uses K to make boundaries of each class. These boundaries will segregate each class from the other.
Accordingly, the boundary becomes smoother with increasing value of K.
So logically speaking, if we increase K to infinity, it will finally become all points of any class depending on the total majority!. However, that would lead to what is called High Bias (i.e. Underfitting).
In contrast, if we make K equals only 1, then the error will always be zero for the training sample. This is because the closest point to any training data point is itself. Nevertheless, we will end up overfitting the boundaries (i.e. High Variance), so it cannot generalize for any new and unseen data!.
Unfortunately, there is no rule of thumb. Choice of K is somewhat driven by the end application as well as the dataset.
Suggested Solution
Using GridSearchCV which performs exhaustive search over specified parameter values for an estimator. So we use it to try find best value of K.
For me, I don't exceed the max class with respect to the number of elements in each class when I want to set the max threshold of K, and it hasn't let me down so far (see the example later to see what I am talking about)
Example:
import numpy as np
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X, y = iris.data, iris.target
# get the max class with respect to the number of elements
max_class = np.max(np.bincount(y))
# you can add other parameters after doing your homework research
# for example, you can add 'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute']
grid_param = {'n_neighbors': range(1, max_class)}
model = KNeighborsClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2)
clf = GridSearchCV(model, grid_param, cv=cv, scoring='accuracy')
clf.fit(X, y)
print("Best Estimator: \n{}\n".format(clf.best_estimator_))
print("Best Parameters: \n{}\n".format(clf.best_params_))
print("Best Score: \n{}\n".format(clf.best_score_))
Result
Best Estimator:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=17, p=2,
weights='uniform')
Best Parameters:
{'n_neighbors': 17}
Best Score:
0.98
An Update Regarding RepeatedStratifiedKFold
In simple words, it's a KFold that is repeated over n_repeats of times, Why? Because it may lower bias and give you a better estimate in terms of statistics.
Also it's Stratified that is it seeks to ensure that each class is approximately equally represented across each test fold (i.e. each fold is representative of all strata of the data).
Based on the graph, I would say 13.
I assume this is a classification job.
in that case: Do not set k to be an even number.
E.g. If you have 2 class A and B, and k is set to 4.
There is a possibility that the new data (or point)
is between 2 class A and 2 class B.
So you will have 2 voting to classify the new data point as A
and 2 voting to classify as B.
Setting k to be an odd number avoid this situation.
I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.