I am using SKlearn KFold as follows:
kf = KFold(10000, n_folds=5, shuffle=True, random_state=88)
However, I want to exclude certain indices from the training folds (only). How can this be achieved? Thanks.
I wonder if this can be achieved by using sklearn.cross_validation.PredefinedSplit?
Update: The KFold instance will be used with XGBoost for the folds parameter of xgb.cv. The Python API here states that folds should be "a KFold or StratifiedKFold instance".
However, I will try generating the KFolds as above, iterating over the train fold indices, modifying them, and then defining a custom_cv by hand like this:
custom_cv = zip(train_indices, test_indices)
If you want to remove indices from the training set, but it is ok if they are in the testing set, then this approach will work:
kf_list = list(kf)
This will return a list of tuples that can be iterated over in the same way as the KFold instance. You can then simply modify the indices as you see fit, and your KFold instance will stay untouched. You can think of a KFold object as an array of integers, representing the indices, and methods that let you generate the folds on the fly.
Here's the source code, which is pretty straightforward, for the meaty part of how the iterator protocol is implemented :
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L254
def _iter_test_indices(self):
n = self.n
n_folds = self.n_folds
fold_sizes = (n // n_folds) * np.ones(n_folds, dtype=np.int)
fold_sizes[:n % n_folds] += 1
current = 0
for fold_size in fold_sizes:
start, stop = current, current + fold_size
yield self.idxs[start:stop]
current = stop
Related
I have a Python code that works well for performing k-fold CV on a dataset. My Python code looks like this:
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle
# Load the dataset.
dataset = pandas.read_csv('values.csv')
# Preprocessing the dataset.
X = dataset.iloc[:, 0:8]
Y = dataset.iloc[:, 8] # The class value is the last column and is called Outcome.
# Scale all values to 0,1.
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
# 3-fold CV computation.
scores = []
svr_rbf = SVR(kernel='rbf', gamma='auto')
cv = KFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
svr_rbf.fit(X_train, Y_train)
scores.append(svr_rbf.score(X_test, Y_test))
Now, I want to rewrite the same thing in R, and I tried to do something like this:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("values.csv", col_names=TRUE)
results <- train(Outcome~.,
data=dataset,
method="smvLinear",
trControl=trainControl(
method="cv",
number=3,
savePredictions=TRUE,
verboseIter=TRUE
))
print(results)
print(results$pred)
My data is similar to this one: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data
Except this one has 12 attributes, and 13th column is the class, in my case there are 8 attributes, and 9th one is the class. But, value-wise it is similar.
Now, I can see the results printing, however there are a few things unclear to me.
1) In my Python code, I did this scaling of values, how can I do that in R?
2) I have used SVR with rbf kernel, how can I use SVR with that kernel in R instead of SMV?
3) Also, in Python version I use a random_state=42 (just a random number) to generating the splittings for the folds, so it uses different folds. But it is consistent throughout different executions. How do this in R?
4) Lastly, in Python I do the training inside a for loop per fold. I need something like this in R too, as after every fold, I want to perform some other statistics and computations. How do do this in R?
5) Should I stick to caret or use mlr package? Does mlr do the k-fold CV too? If yes how?
EDIT:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
print(dataset)
X = dataset[, 1:8]
print(X)
Y = dataset$X9
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
rfCaret <- caret::train(X, Y, method = "svmLinear", trControl = fit.control)
print(rfCaret)
Check out createFolds in the caret package for fixed folds.
Here's some code that you can amend to fit your particular modelling case; this example would build out a randomforest model, but you can switch the model for an SVM. If you follow the package guide there's a link (copied here for ease: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines) - section 7.0.47 lists all the available SVM models and their parameters.Note that you may need to install some additional packages, like kernlab, to use specific models.
There is a package called rngtools that is supposed to allow you to create reproducible models across multiple cores (parallel processing), but if you want to be sure then single core is probably the best way in my experience.
folds <- 3
set.seed(42)
cvIndex <- createFolds(your_data, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
search.grid <- expand.grid(.mtry = c(seq.int(1:sqrt(length(your_data)))+1))
rfCaret <- train(your_data_x, your_data_y, method = "rf",
metric = 'ROC', ntree = 500,
trControl = fit.control, tuneGrid = search.grid,
)
In my experience, caret is pretty good for covering pretty much all bases. If you also want to preprocess your data (e.g. centre, scale) - then you want the function preProcess - again, details in the caret package if you type ?train - but for example you would want
preProcess(yourData, method = c("center", "scale"))
Caret is clever in that it understands if it has taken a preprocessed input, and applies the same scaling to your test data sets.
edit - additional: unused parameters issue
To answer your follow up question about unused parameters - it's probably because you're using mtry which is a random forest parameter.
Here's a version for a simple SVM:
folds <- 3
set.seed(42)
cvIndex <- createFolds(dataset$Outcome, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
SVMCaret <- train(Outcome ~ ., data = dataset, method = "svmLinear",
metric = 'ROC',
trControl = fit.control)
You don't need a tuning grid; Caret will generate a random one. Of course if you want to test specific cost values, then create one yourself in much the same way as I did for the .mtry parameter for randomForests.
1) The caret::train function has a preProcess argument that allows you to do choose a preprocessing. See ?caret::train for more details.
2) There is svmRadial available for caret. You can look at examples and all available algorithms at caret/train-models-by-tag.
3) Fix the random seed with set.seed(123) for consistency. You may access training folds in the train object (results$trainingData here).
4) Don't loop, access your folds directly via your train object and calculate your stats if needed (see results$resample)
5) mlr has cross-validation too, it depends which flavor you like.
tl;dr: Is there any way to call .get_feature_names() on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?
I have a Pipeline that includes fitting and transforming text data with TfidfVectorizer, and then runs a RandomForestClassifier. I want to GridSearchCV across various levels of max_features in the classifier, based on the number of features that the transformation produced from the text.
#setup pipeline
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', RandomForestClassifier(random_state=1,
criterion='entropy',
n_estimators=800))
])
#setup parameter grid
params = {
'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}
Instantiating returns the following error:
NameError: name 'vect' is not defined
Edit:
This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer such as ngram_range, one could see how this could change the number of features output to the next step...
The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.
I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features, i.e. a percentage of the columns coming out of the vectorizer.
If you really want a score for every integer max_features, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit is called:
estimator = RandomForestClassifier(
random_state=1,
criterion='entropy',
n_estimators=800
)
class MySearcher(GridSearchCV):
def fit(self, X, y):
m = X.shape[1]
self.param_grid = {'max_features': np.arange(1, m, 1)}
return super().fit(X, y)
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', MySearcher(estimator=estimator,
param_grid={'fake': ['passes', 'check']}))
])
Now the search results will be awkwardly nested (best values of, say, ngram_range give you a refitted copy of pipe, whose second step will itself have a best value of max_features and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.
In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:
to cross validate a time series data, the training and testing data are often splitted like this:
That is to say, the testing data should be always ahead of training data.
My thought is:
Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.
Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.
Any suggestion is welcome.
I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.
After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.
I did some work regarding all this some months ago.
You could check it out in this question/answer:
Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis
My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.
What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.
I've written some code that I hope could be helpful to someone.
'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation
for sequence in range(40, df.sequence.max() + 1):
train = df[df['sequence'] < sequence]
test = df[df['sequence'] == sequence]
X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
y_train, y_test = train['quantity'].values, test['quantity'].values
mdl = LinearRegression()
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
RMSE.append(error)
print('Mean RMSE = %.5f' % np.mean(RMSE))
Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):
def tscv(X, train_size, test_size):
folds = math.floor(len(X) / test_size)
tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
splits = []
for train_index, test_index in tscv.split(X):
if len(train_index) < train_size:
continue
elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
pass
else:
train_index = train_index[-train_size:]
splits.append([train_index, test_index])
return splits
I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:
|X||V|O|O|O|
|O|X||V|O|O|
|O|O|X||V|O|
|O|O|O|X||V|
X / V are the training / validation sets.
"||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.
You could easily extend it to get longer lookback windows for the training sets.
class StratifiedWalkForward(object):
def __init__(self,n_splits,n_gap):
self.n_splits = n_splits
self.n_gap = n_gap
self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
return
def split(self,X,y,groups=None):
splits = self._cv.split(X,y)
_ixs = []
for ix in splits:
_ixs.append(ix[1])
for i in range(1,len(_ixs)):
yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
def get_n_splits(self,X,y,groups=None):
return self.n_splits
Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.
I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.
The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )
I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.
Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)
class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))
Is
class sklearn.cross_validation.ShuffleSplit(
n,
n_iterations=10,
test_fraction=0.10000000000000001,
indices=True,
random_state=None
)
the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)
Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.
If ShuffleSplit is the right, one concern is that it is mentioned
Note: contrary to other cross-validation strategies, random splits do not
guarantee that all folds will be different, although this is still
very likely for sizeable datasets
Is this always the case for 10*10 fold CV?
I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:
>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
... random_state=42)
If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):
>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
... X, y = shuffle(X_orig, y_orig, random_state=i)
... skf = StratifiedKFold(y, 10)
... print cross_val_score(clf, X, y, cv=skf)