I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.
In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.
Here's the code:
import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets
#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000,
n_features=20,
n_informative=4,
n_redundant=2,
n_repeated=0,
n_classes=2,
n_clusters_per_class=2,
weights=[0.9],
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=0)
model = ensemble.RandomForestClassifier()
w0=1 #weight associated to 0's
w1=1 #weight associated to 1's
#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)
With w0=w1=1 I get, for instance, F1=0.9456.
With w0=w1=10 I get, for instance, F1=0.9569.
With sample_weights=None I get F1=0.9474.
With the Random Forest algorithm, there is, as the name implies, some "Random"ness to it.
You are getting different F1 score because the Random Forest Algorithm (RFA) is using a subset of your data to generate the decision trees, and then averaging across all of your trees. I am not surprised, therefore, that you have similar (but non-identical) F1 scores for each of your runs.
I have tried balancing the weights before. You may want to try balancing the weights by the size of each class in the population. For example, if you were to have two classes as such:
Class A: 5 members
Class B: 2 members
You may wish to balance the weights by assigning 2/7 for each of Class A's members and 5/7 for each of Class B's members. That's just an idea as a starting place, though. How you weight your classes will depend on the problem you have.
Related
I'm building a decision tree model based on data from the "Give me some credit" Kaggle competition (https://www.kaggle.com/competitions/GiveMeSomeCredit/overview). I'm trying to train this model on the training dataset from the competition and then apply to it to my own dataset for research.
The problem I'm facing is that it looks like the f1 score my model gets and the results presented by the confusion matrix do not correlate, and the higher the f1 score is, the worse label prediction becomes. Currently my best parameters for maximizing f1 are the following (the way I measure the score is included):
from sklearn.model_selection import RandomizedSearchCV
import xgboost
classifier=xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree', importance_type='gain')
params={
"colsample_bytree":[0.3],
"gamma":[0.3],
"learning_rate":[0.1],
"max_delta_step":[1],
"max_depth":[4],
"min_child_weight":[9],
"n_estimators":[150],
"num_parallel_tree":[1],
"random_state":[0],
"reg_alpha":[0],
"reg_lambda":[0],
"scale_pos_weight":[4],
"validate_parameters":[1],
"n_jobs":[-1],
"subsample":[1],
}
clf=RandomizedSearchCV(classifier,param_distributions=params,n_iter=100,scoring='f1',cv=10,verbose=3)
clf.fit(X,y)
These parameters give me an f1 score of ≈0.46.
However, when this model is output onto a confusion matrix, the label prediction accuracy for label "1" is only 50% (Picture below).
When attempting to tune the parameters in order to achieve better label prediction, I can improve the label prediction accuracy to 97% for both labels, however that decreases the f1 score to about 0.3. Here's the code I use for creating the confusion matrix (parameters included are the ones that have the f1 score of 0.3):
from xgboost import XGBClassifier
from numpy import nan
final_model = XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0.2, gpu_id=0, grow_policy='depthwise',
importance_type='gain', interaction_constraints='',
learning_rate=1.5, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=9,
missing=nan, monotone_constraints='()', n_estimators=800,
n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5)
final_model.fit(X,y)
pred_xgboost = final_model.predict(X)
cm = confusion_matrix(y, pred_xgboost)
cm_norm = cm/cm.sum(axis=1)[:, np.newaxis]
plt.figure()
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(cm_norm, classes=rf.classes_)
And here's the confusion matrix for these parameters:
I don't understand why there is seemingly no correlation between these two metrics (f1 score and confusion matrix accuracy), perhaps a different scoring system would prove more useful?
Would you kindly show the absolute values?
Technically, cm_norm = cm/cm.sum(axis=1)[:, np.newaxis] would represent recall, not the accuracy. You can easily get a matrix with a good recall but poor precision for the positive class (e.g. [[9000, 300], [1, 30]]) - you can check your precision using the same code with axis=0. (F1 is the harmonic mean of your positive class recall and precision.)
If you wish to optimize for F1, you should also look for an optimal classification threshold on the sklearn.metrics.precision_recall_curve().
There is a relationship, although not so obvious. It would help to understand it better if you generate a classification report.
Also, a higher max_rate can change the value of Recall Specificity, which affects one of the class's f1_score in the classification report, but not the f1-score derived from f1_score(y_valid, predictions). Oversampling can also affect the Recall.
from sklearn.metrics import classification_report
ClassificationReport = classification_report(y_valid,predictions.round(),output_dict=True)
f1_score is the balance between precision and recall. The confusion matrix shows the precision values of both classes. With the classification report, I can see the relationship, like in the example below.
Classification Report
precision recall f1-score support
0 0.722292 0.922951 0.810385 23167.0
1 0.982273 0.923263 0.951854 107132.0
Confusion Matrix using Validation Data (y_valid)
True Negative : CHGOFF (0) was predicted 21382 times correctly (72.23 %)
False Negative : CHGOFF (0) was predicted 8221 times incorrectly (27.77 %)
True Positive : P I F (1) was predicted 98911 times correctly (98.23 %)
False Positive : P I F (1) was predicted 1785 times incorrectly (1.77 %)
I'm using a diabetics dataset which has 3 classes for the target variable. I have used Decision Tree Classifier for the same and optimized the hyperparameters using RandomizedSearchCV of sci-kit learn package and fitted the model to training data. Now, I have found the probability values for the test data which gives the probability for assigning the outcome variable to the 3 classes. Now, I want to calculate the cutoff value such that I can use it to assign the classes. For this purpose, I'm using F1 score to find the appropriate cut off value.
Now, I'm stuck how to find the F1 score. Will the F1 score metric help me to find it?
Here is the dataset
After preprocessing the data, I have spitted the data into training and testing set.
dtree = DecisionTreeClassifier()
params = {'class_weight':[None,'balanced'],
'criterion':['entropy','gini'],
'max_depth':[None,5,10,15,20,30,50,70],
'min_samples_leaf':[1,2,5,10,15,20],
'min_samples_split':[2,5,10,15,20]}
grid_search = RandomizedSearchCV(dtree,cv=10,n_jobs=-1,n_iter=10,scoring='roc_auc_ovr',verbose=20,param_distributions=params)
grid_search.fit(X_train,y_train)
mdl.fit(X_train,y_train)
test_score = mdl.predict_proba(X_test)
The following formula I have created for cutoff for binary classifier -
cutoffs = np.linspace(0.01,0.99,99)
true = y_train
train_score = mdl.predict_proba(X_train)[:,1]
F1_all = []
for cutoff in cutoffs:
pred = (train_score>cutoff).astype(int)
TP = ((pred==1)&(true==1)).sum()
FP = ((pred==1)&(true==0)).sum()
TN = ((pred==0)&(true==0)).sum()
FN = ((pred==0)&(true==1)).sum()
F1 = TP/(TP+0.5*(FP+FN))
F1_all.append(F1)
my_cutoff = cutoffs[F1_all==max(F1_all)][0]
preds = (test_score1>my_cutoff).astype(int)
There is no cutoff value for the softmax output of a multiclass classifier in the same sense as the cutoff value for binary classifier.
When your output is normalized probabilities for multiple classes and you want to convert this into class labels, you just take the label with the highest assigned probability.
Technically you could design some custom schema such as
if class1 has probability of 10% or more, choose class1 label, otherwise
pick a class with the highest assigned probability
which would be sort of a cutoff for class 1 but this is rather arbitrary and I have not seen anyone doing this in practice. If you have some deep insight into your problem which is suggesting that something like this may be useful then go ahead and build your own "cutoff" formula, otherwise you should just stick with the general approach (argmax of the normalized probabilities).
I am looking to perform a backward feature selection process on a logistic regression with the AUC as a criterion. For building the logistic regression I used the scikit library, but unfortunately this library does not seem to have any methods for backward feature selection. My dependent variable is a binary banking crisis variable and I have 13 predictors. Does anybody have any suggestions on how to handle this?
The code below states the method to compute the AUC. The problem is that I do not know how to decide which feature I can prune because it is less important than the other.
def cv_loop(X, y, model, N):
mean_auc = 0.
for i in range(N):
X_train, X_cv, y_train, y_cv = train_test_split(
X, y, test_size=.20,
random_state = i*SEED)
model.fit(X_train, y_train)
preds = model.predict_proba(X_cv)[:,1]
fpr, tpr, _ = metrics.roc_curve(y_cv, preds)
auc = metrics.auc(fpr, tpr)
print("AUC (fold %d/%d): %f" % (i + 1, N, auc))
mean_auc += auc
return mean_auc/N
If you need more background information let me know!
Many thanks in advance,
Joris
scikit-learn has Recursive Feature Elimination (RFE) in its feature_selection module, which almost does what you described.
Given an external estimator that assigns weights to features (e.g.,
the coefficients of a linear model), the goal of recursive feature
elimination (RFE) is to select features by recursively considering
smaller and smaller sets of features. First, the estimator is trained
on the initial set of features and the importance of each feature is
obtained either through a coef_ attribute or through a
feature_importances_ attribute. Then, the least important features are
pruned from current set of features. That procedure is recursively
repeated on the pruned set until the desired number of features to
select is eventually reached.
This doesn't explicitly work on AUC however. It does the prunining by looking at coefficients of the logistic regression.
I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865
I am trying to find best K value for KNeighborsClassifier.
This is my code for iris dataset:
k_loop = np.arange(1,30)
k_scores = []
for k in k_loop:
knn = KNeighborsClassifier(n_neighbors=k)
cross_val = cross_val_score(knn, X, y, cv=10 , scoring='accuracy')
k_scores.append(cross_val.mean())
I have taken mean of cross_val_score in each loop and plotted it.
plt.style.use('fivethirtyeight')
plt.plot(k_loop, k_scores)
plt.show()
This is the result.
You can see the accuracy is higher when k is between 14 to 20.
1) How can I choose the best value of k.
2) Are there any other ways to calculate and find best value for K?
3) Any other improvement suggestions are also appreciated. I'm new to ML
Let's first define what is K?
K is the number of voters that the algorithm consult to make a decision about to which class a given data point it belongs to.
In other words, it uses K to make boundaries of each class. These boundaries will segregate each class from the other.
Accordingly, the boundary becomes smoother with increasing value of K.
So logically speaking, if we increase K to infinity, it will finally become all points of any class depending on the total majority!. However, that would lead to what is called High Bias (i.e. Underfitting).
In contrast, if we make K equals only 1, then the error will always be zero for the training sample. This is because the closest point to any training data point is itself. Nevertheless, we will end up overfitting the boundaries (i.e. High Variance), so it cannot generalize for any new and unseen data!.
Unfortunately, there is no rule of thumb. Choice of K is somewhat driven by the end application as well as the dataset.
Suggested Solution
Using GridSearchCV which performs exhaustive search over specified parameter values for an estimator. So we use it to try find best value of K.
For me, I don't exceed the max class with respect to the number of elements in each class when I want to set the max threshold of K, and it hasn't let me down so far (see the example later to see what I am talking about)
Example:
import numpy as np
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X, y = iris.data, iris.target
# get the max class with respect to the number of elements
max_class = np.max(np.bincount(y))
# you can add other parameters after doing your homework research
# for example, you can add 'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute']
grid_param = {'n_neighbors': range(1, max_class)}
model = KNeighborsClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2)
clf = GridSearchCV(model, grid_param, cv=cv, scoring='accuracy')
clf.fit(X, y)
print("Best Estimator: \n{}\n".format(clf.best_estimator_))
print("Best Parameters: \n{}\n".format(clf.best_params_))
print("Best Score: \n{}\n".format(clf.best_score_))
Result
Best Estimator:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=17, p=2,
weights='uniform')
Best Parameters:
{'n_neighbors': 17}
Best Score:
0.98
An Update Regarding RepeatedStratifiedKFold
In simple words, it's a KFold that is repeated over n_repeats of times, Why? Because it may lower bias and give you a better estimate in terms of statistics.
Also it's Stratified that is it seeks to ensure that each class is approximately equally represented across each test fold (i.e. each fold is representative of all strata of the data).
Based on the graph, I would say 13.
I assume this is a classification job.
in that case: Do not set k to be an even number.
E.g. If you have 2 class A and B, and k is set to 4.
There is a possibility that the new data (or point)
is between 2 class A and 2 class B.
So you will have 2 voting to classify the new data point as A
and 2 voting to classify as B.
Setting k to be an odd number avoid this situation.