Training xgboost with soft labels

Training xgboost with soft labels - python

I'm trying to distill the predictions of another classifier model, "C" using xgboost. Thus, instead of labels, I have the probabilities predicted by C for the samples being positive.
I've tried doing the most obvious thing, using the probabilities output by C as if they were labels
distill_model = XGBClassifier(learning_rate=0.1, max_depth=10, n_estimators=100)
distill_model.fit(X, probabilities)
but it seems that in that case XGBoost just translates each distinct probability value to its own class. So if C output 72 distinct values, XGBoost considers that as 72 to different classes. I've tried changing the objective function to multi:softmax/multi:softprob but that didn't help.
Any suggestions?

There is probably an xgboost specific method with custom loss. But a generic solution is to split each training row into two rows one with each label, and assign each row the original probability for that label as its weight.

Related

Accuracy for each probability cutoff in a binary classification problem (python sklearn accuracy)

Imagine a binary classification problem. Let's say I have 800,000 predicted probabilities stored in pred_test. I define a cutoff as any value in pred_test such that the values that are greater than or equal to cutoff are assigned the value 1 and the values that are smaller than cutoff are assigned the value 0.
Is there a function in sklearn that returns the accuracy of the model for each cutoff in pred_train? I would like to see the accuracy of the model as a function of each cutoff to systematically pick a cutoff.
I tried the following:
_list = []
for cutoff in np.unique(np.sort(pred_test)):
binary_prediction = np.where(pred_test >= cutoff, 1, 0)
_list.append( (cutoff, binary_prediction == y_test).sum() / len(pred_test) )
Here, y_test is the ground truth (an array with the observed outcomes for each of the 800,000 rows). This code returns a list where each value contains the cutoff and its corresponding accuracy score.
The object pred_test has around 600,000 different values, so I am iterating 600,000 or so times. The above code is working, but it's taking a very long time to finish. Is there a more efficient way to do this? My bet is that sklearn already has a function that does this.

here is some similiar thread to check it: Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn
There is no built-in function for that in scikit-learn. I think the reason why this is not implemented is that you will have the chance to overfit, you basically will tune your train set to a baseline that is risky for the test set.

SHAP value can explain right?

I face a problem with using SHAP value to interpret the Tree-based model.
(https://github.com/slundberg/shapsd)
First, I have input around 30 features and I have 2 features that have high positive correlation between them.
After that, I train the XGBoost model(python) and look at SHAP values of 2 features the SHAP values have negative correlation.
Could you all explain to me, why the output SHAP values between 2 features don't have the correlation the same as input correlation? and I can trust that output of SHAP or not?
=========================
The correlation between input: 0.91788
The correlation between SHAP values: -0.661088
2 features are
1) Pupulation in province and
2) Number of family in province.
Model Performance
Train AUC: 0.73
Test AUC: 0.71
Scatter plot
Input scatter plot (x: Number of family in province, y: Pupulation in province)
SHAP values output scatter plot (x: Number of family in province, y: Pupulation in province)

You can have correlated variables that have opposite effects on the model output.
As an example, let's take the case of predicting risk of mortality given two features: 'age' and 'trips to doctors'. Although these two variables are positively correlated, their effects are different. All other things held constant, a higher 'age' leads to a higher risk of mortality (according to the trained model). And a higher number of 'trips to doctor' leads to a smaller risk of mortality.
XGBoost (and SHAP) isolates the effect of these two correlated variables by conditioning on the other variable: e.g. splitting on 'trips to doctors' feature, after splitting on 'age' feature. Assumption here is that they are not perfectly correlated.

XGBoost is not a linear model, i.e. the relationship between the input features X and the predictions y is not linear. SHAP values build a linear explanation model of y. Therefore, it is very much expected that the correlation between input features and their SHAP values do not match.

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?
SAMPLE_COUNT = 5000
TEST_COUNT = 20000
seed(0)
sample = list()
test_sample = list()
for index, line in enumerate(open('covtype.data','rb')):
if index < SAMPLE_COUNT:
sample.append(line)
else:
r = randint(0,index)
if r < SAMPLE_COUNT:
sample[r] = line
else:
k = randint(0,index)
if k < TEST_COUNT:
if len(test_sample) < TEST_COUNT:
test_sample.append(line)
else:
test_sample[k] = line
from sklearn.preprocessing import StandardScaler
for n, line in enumerate(sample):
sample[n] = map(float, line.strip().split(','))
y = np.array(sample)[:,-1]
scaling = StandardScaler()
X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform
for n,line in enumerate(test_sample):
test_sample[n] = map(float,line.strip().split(','))
yt = np.array(test_sample)[:,-1]
Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform
As the annotation says, why Xt only use transform but no fit?

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.
We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.
This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html

We have two datasets : The training and the test dataset. Imagine we have just 2 features :
'x1' and 'x2'.
Now consider this (A very hypothetical example):
A sample in the training data has values: 'x1' = 100 and 'x2' = 200
When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.
A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.
I have tried to explain the intuition behind this logic below:
We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature.
This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.
Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:
*learned parameters. Which in turn depend on the
*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the
*training data's mean and std.
If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.

Any transformation you do to the data must be done by the parameters generated by the training data.
Simply what fit() method does is create a model that extracts the various parameters from your training samples to do the neccessary transformation later on. transform() on the other hand is doing the actual transformation to the data itself returning a standardized or scaled form.
fit_transform() is just a faster way of doing the operations of fit() and transform() consequently.
Important thing here is that when you divide your dataset into train and test sets what you are trying to achieve is somewhat simulate a real world application. In a real world scenario you will only have training data and you will develop a model according to that and predict unseen instances of similar data.
If you transform the entrire data with fit_transform() and then split to train test you violate that simulation approach and do the transformation according to the unseen examples as well. Which will inevatibly result in an optimistic model as you already somewhat prepared your model by the unseen samples metrics as well.
If you split the data to train test and apply fit_transform() to both you will also be mistaken as your first transformation of train data will be done by train splits metrics only and your second will be done by test metrics only.
The right way to do these preprocessings is to train any transformer with train data only and do the transformations to the test data. Because only then you can be sure that your resulting model represents a real world solution.
Following this it actually doesnt matter if you
fit(train) then transform(train) then transform(test) OR
fit_transform(train) then transform(test)

fit() is used to compute the parameter needed for transformation and transform() is for scaling the data to convert into standard format for the model.
fit_tranform() is combination of two which is doing above work in efficiently.
Since fit_transform() is already computing and transforming the training data only transformation for testing data is left,since parameter needed for transformation is already computed and stored only transformation() of testing data is left therefor only transform() is used instead of fit_transform().

there could be two approaches:
1st approach scale with fit and transform train data, transform only test data
2nd fit and transform the whole set :train + test
if you think about: how will the model handle scaling when goes live?: When new data arrives, new data will behave just like the unseen test data in your backtest.
In the 1st case , new data will will just be scale transformed and your model backtest scaled values remain unchanged.
But in the 2nd case when new data comes then you will need to fit transform the whole dataset , that means that the backtest scaled values will no longer be the same and then you need to re-train the model..if this task can be done quickly then I guess it is ok
but the 1st case requires less work...
and if there are big differences between scaling in train and test then probably the data is non stationary and ML is probably not a good idea

fit() and transform() are the two methods used to generally account for the missing values in the dataset.The missing values can be filled either by computing the mean or the median of the data and filling that empty places with that mean or median.
fit() is used to calculate the mean or the median.
transform() is used to fill in missing values with the calculated mean or the median.
fit_tranform() performs the above 2 tasks in a single stretch.
fit_transform() is used for the training data to perform the above.When it comes to validation set only transform() is required since you dont want to change the way you handle missing values when it comes to the validation set, because by doing so you may take your model by surprise!! and hence it may fail to perform as expected.

we use fit() or fit_transform() in order to learn (to train the model) on the train data set. transform() can be used on the trained model against the test data set.

fit_transform() - learn the parameter of scaling (Train data)
transform() - Apply those learned scaling method here (Test data)
ss = StandardScaler()
X_train = ss.fit_transform(X_train) #here we need to feed this to the model to learn so it will learn the parameter of scaling
X_test = ss.transform(X_test) #It will use the learn parameter to transform

Leave-one-out cross-validation

I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).
Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?
Unsuccessful code:
from sklearn import ensemble, cross_validation, datasets
dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)
loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
print('Sample %d score: %f' % (test_i[0], score))
Resulting exception:
ValueError: r2_score can only be computed given more than one sample.
[EDIT, to clarify]:
I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?

cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.
Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.
Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.
Edit: After a quick discussion with #banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using
numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])

using RandomForestClassifier.predict_proba vs RandomForestRegressor.predict

I have a data set comprising a vector of features, and a target - either 1.0 or 0.0 (representing two classes). If I fit a RandomForestRegressor and call its predict function, is it equivalent to using RandomForestClassifier.predict_proba()?
In other words if the target is 1.0 or 0.0 does RandomForestRegressor output probabilities?
I think so, and the results I a m getting suggest so, but I would like to get a second opinion...
Thanks
Weasel

There is a major conceptual diffrence between those, based on different tasks being addressed:
Regression: continuous (real-valued) target variable.
Classification: discrete target variable (classes).
For a general classification method, term probability of observation being class X may be not defined, as some classification methods, knn for example, do not deal with probabilities.
However for Random Forest (and some other classification methods), classification is reduced to regression of classes probabilities destibution. Predicted class is taked then as argmax of computed "probabilities". In your case, you feed the same input, you get the same result. And yes, it is ok to treat values returned by RandomForestRegressor as probabilities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.