Get prediction and distance with Scikit KNeighborsClassifier

Get prediction and distance with Scikit KNeighborsClassifier - python

According to the doc, Scikit's KNeighborsClassifier offers these two methods to get predictions:
predict(X) : Returns class labels.
kneighbors(X) : Returns distances and indices of the nearest points in the training data.
I'm in need of a mix of both: Getting the class label and the distance of that prediction. I'd like to avoid having to lookup the training data when using the kneighbors method (which returns only the index). Any way to do that?

After you get the indices from kneighbors(X), you can directly lookup the class label for each of those indices as such:
class_label = clf.classes_[clf._y[index]]

Related

Training xgboost with soft labels

I'm trying to distill the predictions of another classifier model, "C" using xgboost. Thus, instead of labels, I have the probabilities predicted by C for the samples being positive.
I've tried doing the most obvious thing, using the probabilities output by C as if they were labels
distill_model = XGBClassifier(learning_rate=0.1, max_depth=10, n_estimators=100)
distill_model.fit(X, probabilities)
but it seems that in that case XGBoost just translates each distinct probability value to its own class. So if C output 72 distinct values, XGBoost considers that as 72 to different classes. I've tried changing the objective function to multi:softmax/multi:softprob but that didn't help.
Any suggestions?

There is probably an xgboost specific method with custom loss. But a generic solution is to split each training row into two rows one with each label, and assign each row the original probability for that label as its weight.

How to get the top predictions of isolationforest in sklearn

I am using IsolationForest as follows to detect outlier data points of my dataset.
clf = IsolationForest(max_samples='auto',
random_state=42,
behaviour="new",
contamination=.01)
clf.fit(X)
y_pred_train = clf.predict(X)
outliers = []
for item in np.where(y_pred_train == -1)[0]:
outliers.append(df_nodes[item])
I want the predicted outliers as a ranked list. That is, I want to know what was the most potential outlier and the next and so on (maybe sorted using some probability of prediction). I was trying to find out a way to do it in sklearn. However, I still could not find a way. Please let me know a suitable way of doing this.
I am happy to provide more details if needed.

Instead of using predict, use decision_function.
From the docs:
Methods
decision_function(self, X) Average anomaly score of X of the base classifiers.
Then, you can rank them based on their anomaly score. The lower this value, the more abnormal the observation is.

How to generate a custom cross-validation generator in scikit-learn?

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.

The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )

I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.

Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)

class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))

Leave-one-out cross-validation

I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).
Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?
Unsuccessful code:
from sklearn import ensemble, cross_validation, datasets
dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)
loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
print('Sample %d score: %f' % (test_i[0], score))
Resulting exception:
ValueError: r2_score can only be computed given more than one sample.
[EDIT, to clarify]:
I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?

cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.
Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.
Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.
Edit: After a quick discussion with #banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using
numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])

using RandomForestClassifier.predict_proba vs RandomForestRegressor.predict

I have a data set comprising a vector of features, and a target - either 1.0 or 0.0 (representing two classes). If I fit a RandomForestRegressor and call its predict function, is it equivalent to using RandomForestClassifier.predict_proba()?
In other words if the target is 1.0 or 0.0 does RandomForestRegressor output probabilities?
I think so, and the results I a m getting suggest so, but I would like to get a second opinion...
Thanks
Weasel

There is a major conceptual diffrence between those, based on different tasks being addressed:
Regression: continuous (real-valued) target variable.
Classification: discrete target variable (classes).
For a general classification method, term probability of observation being class X may be not defined, as some classification methods, knn for example, do not deal with probabilities.
However for Random Forest (and some other classification methods), classification is reduced to regression of classes probabilities destibution. Predicted class is taked then as argmax of computed "probabilities". In your case, you feed the same input, you get the same result. And yes, it is ok to treat values returned by RandomForestRegressor as probabilities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get prediction and distance with Scikit KNeighborsClassifier - python

After you get the indices from kneighbors(X), you can directly lookup the class label for each of those indices as such: class_label = clf.classes_[clf._y[index]]

Related

Training xgboost with soft labels

How to get the top predictions of isolationforest in sklearn

How to generate a custom cross-validation generator in scikit-learn?

Leave-one-out cross-validation

using RandomForestClassifier.predict_proba vs RandomForestRegressor.predict

Categories

Resources