SHAP value can explain right? - python

I face a problem with using SHAP value to interpret the Tree-based model.
First, I have input around 30 features and I have 2 features that have high positive correlation between them.
After that, I train the XGBoost model(python) and look at SHAP values of 2 features the SHAP values have negative correlation.
Could you all explain to me, why the output SHAP values between 2 features don't have the correlation the same as input correlation? and I can trust that output of SHAP or not?
The correlation between input: 0.91788
The correlation between SHAP values: -0.661088
2 features are
1) Pupulation in province and
2) Number of family in province.
Model Performance
Train AUC: 0.73
Test AUC: 0.71
Scatter plot
Input scatter plot (x: Number of family in province, y: Pupulation in province)
SHAP values output scatter plot (x: Number of family in province, y: Pupulation in province)

You can have correlated variables that have opposite effects on the model output.
As an example, let's take the case of predicting risk of mortality given two features: 'age' and 'trips to doctors'. Although these two variables are positively correlated, their effects are different. All other things held constant, a higher 'age' leads to a higher risk of mortality (according to the trained model). And a higher number of 'trips to doctor' leads to a smaller risk of mortality.
XGBoost (and SHAP) isolates the effect of these two correlated variables by conditioning on the other variable: e.g. splitting on 'trips to doctors' feature, after splitting on 'age' feature. Assumption here is that they are not perfectly correlated.

XGBoost is not a linear model, i.e. the relationship between the input features X and the predictions y is not linear. SHAP values build a linear explanation model of y. Therefore, it is very much expected that the correlation between input features and their SHAP values do not match.


Training xgboost with soft labels

I'm trying to distill the predictions of another classifier model, "C" using xgboost. Thus, instead of labels, I have the probabilities predicted by C for the samples being positive.
I've tried doing the most obvious thing, using the probabilities output by C as if they were labels
distill_model = XGBClassifier(learning_rate=0.1, max_depth=10, n_estimators=100), probabilities)
but it seems that in that case XGBoost just translates each distinct probability value to its own class. So if C output 72 distinct values, XGBoost considers that as 72 to different classes. I've tried changing the objective function to multi:softmax/multi:softprob but that didn't help.
Any suggestions?
There is probably an xgboost specific method with custom loss. But a generic solution is to split each training row into two rows one with each label, and assign each row the original probability for that label as its weight.

Different F1 scores for different preprocessing techniques- sklearn

I am building a classification model using sklearn's GradientBoostingClassifier. For the same model, I tried different preprocessing techniques: StandarScaler, Scale, and Normalizer on the same data but I am getting different f1_scores each time. For StandardScaler, it is highest and lowest for Normalizer. Why is it so? Is there any other technique for which I can get an even higher score?
The difference lies in their respective definitions:
StandardScaler: Standardize features by removing the mean and scaling to unit variance
Normalizer: Normalize samples individually to unit norm.
Scale: Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
The data used to fit your model will change, so will the F1 score.
Here is a useful link comparing different scalers :

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc'),y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

which averaging should be used when computing the ROC AUC on imbalanced data set?

I am doing a binary classification task on imbalanced data set .. and right now computing the ROC AUC using :
sklearn.metrics.roc_auc_score(y_true, y_score, average='macro') source
and I have two questions:
I am not sure if the averaging macro is influenced by the class imbalance here and what is the best averaging in this situation (when classifying imbalanced classes)?
Is there a reference for the way that shows how scikit-learn calculate the ROC AUC with the different averaging argument ?
If your target variable is binary, then average does not make sense and is ignored. See and also the comment in the doc:
The average='weighted' is your choice for the problem of imbalanced classes
as it follows from in
Using average='macro' is the reasonable way to go. Hopefully, you already trained your model with consideration of the data's imbalance. So now, when evaluating performance, you want to give both classes the same weight.
For example, if your set consists of 90% positive examples, and let's say the roc auc for the positive label is 0.8, and the roc auc for the negative label is 0.4. Using average='weighted' will produce an average roc auc of 0.8 * 0.9 + 0.4 * 0.1 = 0.76. Obviously, it is mostly affected by the positive label's score. Using average='macro' will result in a score that gives the minority label (0) equal weight. In this case, 0.6.
To conclude, if you don't care much about precision and recall relating to the negative label, use average='weighted'. Otherwise, use average='macro'.

using RandomForestClassifier.predict_proba vs RandomForestRegressor.predict

I have a data set comprising a vector of features, and a target - either 1.0 or 0.0 (representing two classes). If I fit a RandomForestRegressor and call its predict function, is it equivalent to using RandomForestClassifier.predict_proba()?
In other words if the target is 1.0 or 0.0 does RandomForestRegressor output probabilities?
I think so, and the results I a m getting suggest so, but I would like to get a second opinion...
There is a major conceptual diffrence between those, based on different tasks being addressed:
Regression: continuous (real-valued) target variable.
Classification: discrete target variable (classes).
For a general classification method, term probability of observation being class X may be not defined, as some classification methods, knn for example, do not deal with probabilities.
However for Random Forest (and some other classification methods), classification is reduced to regression of classes probabilities destibution. Predicted class is taked then as argmax of computed "probabilities". In your case, you feed the same input, you get the same result. And yes, it is ok to treat values returned by RandomForestRegressor as probabilities.

