How to perform k-fold CV in R? - python

I have a Python code that works well for performing k-fold CV on a dataset. My Python code looks like this:
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle
# Load the dataset.
dataset = pandas.read_csv('values.csv')
# Preprocessing the dataset.
X = dataset.iloc[:, 0:8]
Y = dataset.iloc[:, 8] # The class value is the last column and is called Outcome.
# Scale all values to 0,1.
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
# 3-fold CV computation.
scores = []
svr_rbf = SVR(kernel='rbf', gamma='auto')
cv = KFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
svr_rbf.fit(X_train, Y_train)
scores.append(svr_rbf.score(X_test, Y_test))
Now, I want to rewrite the same thing in R, and I tried to do something like this:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("values.csv", col_names=TRUE)
results <- train(Outcome~.,
data=dataset,
method="smvLinear",
trControl=trainControl(
method="cv",
number=3,
savePredictions=TRUE,
verboseIter=TRUE
))
print(results)
print(results$pred)
My data is similar to this one: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data
Except this one has 12 attributes, and 13th column is the class, in my case there are 8 attributes, and 9th one is the class. But, value-wise it is similar.
Now, I can see the results printing, however there are a few things unclear to me.
1) In my Python code, I did this scaling of values, how can I do that in R?
2) I have used SVR with rbf kernel, how can I use SVR with that kernel in R instead of SMV?
3) Also, in Python version I use a random_state=42 (just a random number) to generating the splittings for the folds, so it uses different folds. But it is consistent throughout different executions. How do this in R?
4) Lastly, in Python I do the training inside a for loop per fold. I need something like this in R too, as after every fold, I want to perform some other statistics and computations. How do do this in R?
5) Should I stick to caret or use mlr package? Does mlr do the k-fold CV too? If yes how?
EDIT:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
print(dataset)
X = dataset[, 1:8]
print(X)
Y = dataset$X9
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
rfCaret <- caret::train(X, Y, method = "svmLinear", trControl = fit.control)
print(rfCaret)

Check out createFolds in the caret package for fixed folds.
Here's some code that you can amend to fit your particular modelling case; this example would build out a randomforest model, but you can switch the model for an SVM. If you follow the package guide there's a link (copied here for ease: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines) - section 7.0.47 lists all the available SVM models and their parameters.Note that you may need to install some additional packages, like kernlab, to use specific models.
There is a package called rngtools that is supposed to allow you to create reproducible models across multiple cores (parallel processing), but if you want to be sure then single core is probably the best way in my experience.
folds <- 3
set.seed(42)
cvIndex <- createFolds(your_data, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
search.grid <- expand.grid(.mtry = c(seq.int(1:sqrt(length(your_data)))+1))
rfCaret <- train(your_data_x, your_data_y, method = "rf",
metric = 'ROC', ntree = 500,
trControl = fit.control, tuneGrid = search.grid,
)
In my experience, caret is pretty good for covering pretty much all bases. If you also want to preprocess your data (e.g. centre, scale) - then you want the function preProcess - again, details in the caret package if you type ?train - but for example you would want
preProcess(yourData, method = c("center", "scale"))
Caret is clever in that it understands if it has taken a preprocessed input, and applies the same scaling to your test data sets.
edit - additional: unused parameters issue
To answer your follow up question about unused parameters - it's probably because you're using mtry which is a random forest parameter.
Here's a version for a simple SVM:
folds <- 3
set.seed(42)
cvIndex <- createFolds(dataset$Outcome, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
SVMCaret <- train(Outcome ~ ., data = dataset, method = "svmLinear",
metric = 'ROC',
trControl = fit.control)
You don't need a tuning grid; Caret will generate a random one. Of course if you want to test specific cost values, then create one yourself in much the same way as I did for the .mtry parameter for randomForests.

1) The caret::train function has a preProcess argument that allows you to do choose a preprocessing. See ?caret::train for more details.
2) There is svmRadial available for caret. You can look at examples and all available algorithms at caret/train-models-by-tag.
3) Fix the random seed with set.seed(123) for consistency. You may access training folds in the train object (results$trainingData here).
4) Don't loop, access your folds directly via your train object and calculate your stats if needed (see results$resample)
5) mlr has cross-validation too, it depends which flavor you like.

Related

Feature selection: after or during nested cross-validation?

I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.
I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.

Decision boundary changing in sklearn each time I run code

In Udacity's Intro to Machine Learning class, I am finding that the result of my code can change each time I run it. The correct values are acc_min_samples_split_2 = .908 and acc_min_samples_split_2 = .912, but when I run my script, sometimes the value for acc_min_samples_split_2 = .912 as well. This happens on both my local machine and the web interface within Udacity. Why might this be happening?
The program uses the SciKit Learn library for python.
Here is the part of the code that I wrote:
def classify(features, labels, samples):
# Creates a new Decision Tree Classifier, and fits it based on sample data
# and a specified min_sample_split value
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split = samples)
clf = clf.fit(features, labels)
return clf
#Create a classifier with a min sample split of 2, and test its accuracy
clf2 = classify(features_train, labels_train, 2)
acc_min_samples_split_2 = clf2.score(features_test,labels_test)
#Create a classifier with a min sample split of 50, and test its accuracy
clf50 = classify(features_train, labels_train, 50)
acc_min_samples_split_50 = clf50.score(features_test,labels_test)
def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
print submitAccuracies()
Some classifiers within scikit-learn are of stochastic nature using some PRNG to generate random-numbers internally.
DecisionTree is one of them. Check the docs and use the argument random_state to make that random-behaviour deterministic.
Just create your fit-object like:
clf = tree.DecisionTreeClassifier(min_samples_split = samples, random_state=0) # or any other constant
If you don't provide a random_state or some seed/integer like in my example above, the PRNG will be seeded by some external source (most probably based on system-time) resulting in different results across runs of that script.*
Two runs, sharing the code and given constant will behave equal (ignoring some pathological architecture/platform stuff).

how to implement walk forward testing in sklearn?

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:
to cross validate a time series data, the training and testing data are often splitted like this:
That is to say, the testing data should be always ahead of training data.
My thought is:
Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.
Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.
Any suggestion is welcome.
I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.
After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.
I did some work regarding all this some months ago.
You could check it out in this question/answer:
Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis
My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.
What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.
I've written some code that I hope could be helpful to someone.
'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation
for sequence in range(40, df.sequence.max() + 1):
train = df[df['sequence'] < sequence]
test = df[df['sequence'] == sequence]
X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
y_train, y_test = train['quantity'].values, test['quantity'].values
mdl = LinearRegression()
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
RMSE.append(error)
print('Mean RMSE = %.5f' % np.mean(RMSE))
Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):
def tscv(X, train_size, test_size):
folds = math.floor(len(X) / test_size)
tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
splits = []
for train_index, test_index in tscv.split(X):
if len(train_index) < train_size:
continue
elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
pass
else:
train_index = train_index[-train_size:]
splits.append([train_index, test_index])
return splits
I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:
|X||V|O|O|O|
|O|X||V|O|O|
|O|O|X||V|O|
|O|O|O|X||V|
X / V are the training / validation sets.
"||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.
You could easily extend it to get longer lookback windows for the training sets.
class StratifiedWalkForward(object):
def __init__(self,n_splits,n_gap):
self.n_splits = n_splits
self.n_gap = n_gap
self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
return
def split(self,X,y,groups=None):
splits = self._cv.split(X,y)
_ixs = []
for ix in splits:
_ixs.append(ix[1])
for i in range(1,len(_ixs)):
yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
def get_n_splits(self,X,y,groups=None):
return self.n_splits
Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.

faster data fitting ( or learn) function in python scikit

I am using scikit for my machine learning purposes . While I followed the steps exactly as mentioned in its official documentation but I encounter two problems. Here is the main part of the code :
1) trdata is training data created using sklearn.train_test_split.
2) ptest and ntest is test data of positives and negatives respectively
## Preprocessing
scaler = StandardScaler(); scaler.fit(trdata);
trdata = scaler.transform(trdata)
ptest = scaler.transform(ptest); ntest = scaler.transform(ntest)
## Building Classifier
# setting gamma and C for grid search optimization, RBF Kernel and SVM classifier
crange = 10.0**np.arange(-2,9); grange = 10.0**np.arange(-5,4)
pgrid = dict(gamma = grange, C = crange)
cv = StratifiedKFold(y = tg, n_folds = 3)
## Threshold Ranging
clf = GridSearchCV(SVC(),param_grid = pgrid, cv = cv, n_jobs = 8)
## Training Classifier: Semi Supervised Algorithm
clf.fit(trdata,tg,n_jobs=8)
Problem 1) When I use n_jobs = 8 in GridSearchCV, the code runs till GridSearchCV but hangs or say takes exceptionally long time without result in executing 'clf.fit' , even for a very small dataset. When I remove it then both execute but clf.fit takes very long time to converge for large datasets. My data size is 600 x 12 matrix for both positive and negatives. Can you tell me what exactly n_jobs will do and how it should be used? Also is there any faster fitting technique or modification in code that can be applied to make it faster ?
Problem 2) also StandardScaler should be used upon positive and negative data combined or separately for both ? I suppose it has to be used combined because then only we can use the scaler parameters upon the test sets.
SVC seems to be very sensitive to the data that is not normalized, you may try to normalize the data by:
from sklearn import preprocessing
trdata = preprocessing.scale(trdata)

Categories

Resources