How to generate a custom cross-validation generator in scikit-learn? - python

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.

The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )

I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.

Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)

class own_custom_CrossValidator:#like those in source sklearn/model_selection/
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))


Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.
training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)
# fit the regressor with X and Y data, y_train)
y_pred = regressor.predict(X_test)
yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test
i get negative r2 scores, and high mse
training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])
y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))]
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)
regressor = DecisionTreeRegressor(random_state = 0), y_train)
y_pred = regressor.predict(X_test)
I was expecting more or less the same results, and all the data types seem the same for both calls.
What am I missing?
I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).
Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.
Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.
Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)
If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).
P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)
1 -
The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.
When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.
If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.
suppose your dataset contains this data.
1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0
So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better
2-2= 0
So it knows what do to when it sees this data
4-4#since it learned both addition and subtraction
But when you manually shuffle it like this
1 + 1 = 2
2 + 2 =4#only learned addition
It doesn't know what do do when it sees this data
2 - 2
4 - 4#test data is subtraction
Hope this answers you question
It may sound like a simple check but..
In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.
The other line that might be out of place is line 6:
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))]
good luck, let us know if this helps.

Nested cross-validation example on Scikit-learn

I'm trying to work my head around the example of Nested vs. Non-Nested CV in Sklearn. I checked multiple answers but I am still confused on the example.
To my knowledge, a nested CV aims to use a different subset of data to select the best parameters of a classifier (e.g. C in SVM) and validate its performance. Therefore, from a dataset X, the outer 10-folds CV (for simplicity n=10) creates 10 training sets and 10 test sets:
(Tr0, Te0),..., (Tr0, Te9)
Then, the inner 10-CV splits EACH outer training set into 10 training and 10 test sets:
From Tr0: (Tr0_0,Te_0_0), ... , (Tr0_9,Te0_9)
From Tr9: (Tr9_0,Te_9_0), ... , (Tr9_9,Te9_9)
Now, using the inner CV, we can find the best values of C for every single outer Training set. This is done by testing all the possible values of C with the inner CV. The value providing the highest performance (e.g. accuracy) is chosen for that specific outer Training set. Finally, having discovered the best C values for every outer Training set, we can calculate an unbiased accuracy using the outer Test sets. With this procedure, the samples used to identify the best parameter (i.e. C) are not used to compute the performance of the classifier, hence we have a totally unbiased validation.
The example provided in the Sklearn page is:
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv), y_iris)
non_nested_scores[i] = clf.best_score_
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()
From what I understand, the code simply calculates the scores using two different cross-validations (i.e. different splits into training and test set). Both of them used the entire dataset. The GridCV identifies the best parameters using one (of the two CVs), then cross_val_score calculates, with the second CV, the performance when using the best parameters.
Am I interpreting a Nested CV in the wrong way? What am I missing from the example?

NDCG as scoring function with GridSearchCV and stratified data?

I'm working on a learning to rank task, dataset has a column thread_id which is a group label (stratified data).
In the evaluation phase I must take into account these groups as my scoring function works on a per-thread fashion (e.g. nDCG).
Now, if I implement nDCG with a signature scorer(estimator, X, y) I can easily pass it to GridSearchCV as scoring function as in the example below:
def my_nDCG(estimator, X, y):
# group by X['thread_id']
# compute the result
return result
splitter = GroupShuffleSplit(...).split(X, groups=X['thread_id'])
cv = GridSearchCV(clf, cv=splitter, scoring=my_nDCG)
GridSearchCV selects the model by calling my_nDCG().
Unfortunately, inside my_nDCG, X doesn't have the thread_id column as it must be dropped beforehand passing X to fit(), otherwise I'd train the model using thread_id as feature.'best_answer', axis=1), y)
How can I do this without the terrible workaround of keeping thread_id apart as global and merging it with X inside my_nDCG()?
Is there any other way to use nDCG with scikit-learn? I see scikit supports stratified data but when it comes to model evaluation with stratified data it seems missing proper support.
Just noticed accepts a groups parameter, in my case it'd still be X['thread_id'].
At this point I only need to read that param within my custom scoring function. How to do it?

Exclude certain indices from a KFold split in Python SKlearn

I am using SKlearn KFold as follows:
kf = KFold(10000, n_folds=5, shuffle=True, random_state=88)
However, I want to exclude certain indices from the training folds (only). How can this be achieved? Thanks.
I wonder if this can be achieved by using sklearn.cross_validation.PredefinedSplit?
Update: The KFold instance will be used with XGBoost for the folds parameter of The Python API here states that folds should be "a KFold or StratifiedKFold instance".
However, I will try generating the KFolds as above, iterating over the train fold indices, modifying them, and then defining a custom_cv by hand like this:
custom_cv = zip(train_indices, test_indices)
If you want to remove indices from the training set, but it is ok if they are in the testing set, then this approach will work:
kf_list = list(kf)
This will return a list of tuples that can be iterated over in the same way as the KFold instance. You can then simply modify the indices as you see fit, and your KFold instance will stay untouched. You can think of a KFold object as an array of integers, representing the indices, and methods that let you generate the folds on the fly.
Here's the source code, which is pretty straightforward, for the meaty part of how the iterator protocol is implemented :
def _iter_test_indices(self):
n = self.n
n_folds = self.n_folds
fold_sizes = (n // n_folds) * np.ones(n_folds,
fold_sizes[:n % n_folds] += 1
current = 0
for fold_size in fold_sizes:
start, stop = current, current + fold_size
yield self.idxs[start:stop]
current = stop

how to implement walk forward testing in sklearn?

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:
to cross validate a time series data, the training and testing data are often splitted like this:
That is to say, the testing data should be always ahead of training data.
My thought is:
Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.
Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code and find it is a little complicated.
Any suggestion is welcome.
I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.
After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.
I did some work regarding all this some months ago.
You could check it out in this question/answer:
Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis
My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.
What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.
I've written some code that I hope could be helpful to someone.
'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation
for sequence in range(40, df.sequence.max() + 1):
train = df[df['sequence'] < sequence]
test = df[df['sequence'] == sequence]
X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
y_train, y_test = train['quantity'].values, test['quantity'].values
mdl = LinearRegression(), y_train)
y_pred = mdl.predict(X_test)
error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
print('Mean RMSE = %.5f' % np.mean(RMSE))
Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):
def tscv(X, train_size, test_size):
folds = math.floor(len(X) / test_size)
tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
splits = []
for train_index, test_index in tscv.split(X):
if len(train_index) < train_size:
elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
train_index = train_index[-train_size:]
splits.append([train_index, test_index])
return splits
I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:
X / V are the training / validation sets.
"||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.
You could easily extend it to get longer lookback windows for the training sets.
class StratifiedWalkForward(object):
def __init__(self,n_splits,n_gap):
self.n_splits = n_splits
self.n_gap = n_gap
self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
def split(self,X,y,groups=None):
splits = self._cv.split(X,y)
_ixs = []
for ix in splits:
for i in range(1,len(_ixs)):
yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
def get_n_splits(self,X,y,groups=None):
return self.n_splits
Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.

