After K-fold cross validation, which evaluation metric was averaged?
Precision and recall, or F-measure?
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
KFold(n_splits=2, random_state=None, shuffle=False)
The sklearn.model_selection.KFold function is a utility that provides the folds but does not actually perform k-fold validation. You have to implement this yourself!
See documentation description:
Provides train/test indices to split data in train/test sets. Split
dataset into k consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the k - 1 remaining
folds form the training set.
Related
I am learning Machine learning and I am having this doubt. Can anyone tell me what is the difference between:-
from sklearn.model_selection import cross_val_score
and
from sklearn.model_selection import KFold
I think both are used for k fold cross validation, but I am not sure why to use two different code for same function.
If there is something I am missing please do let me know. ( If possible please explain difference between these two methods)
Thanks,
cross_val_score is a function which evaluates a data and returns the score.
On the other hand, KFold is a class, which lets you to split your data to K folds.
So, these are completely different. Yo can make K fold of data and use it on cross validation like this:
# create a splitter object
kfold = KFold(n_splits = 10)
# define your model (any model)
model = XGBRegressor(**params)
# pass your model and KFold object to cross_val_score
# to fit and get the mse of each fold of data
cv_score = cross_val_score(model,
X, y,
cv=kfold,
scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())
cross_val_score evaluates the score using cross validation by randomly splitting the training sets into distinct subsets called folds, then it trains and evaluated the model on the folds, picking a different fold for evaluation every time and training on the other folds.
cv_score = cross_val_score(model, data, target, scoring, cv)
KFold procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
cv_score = cross_val_score(model, data, target, scoring, cv=cv)
where model is your model on which you want to evaluate,
data is training data,
target is target variable,
scoring parameter controls what metric applied to the estimator applied and cv is the number of splits.
I am using a Random Forest Classifier and I want to perform k-fold cross validation.
My dataset is already split in 10 different subsets, so I'd like to use them to do k-fold cross validation, without using automatic functions that randomly split the dataset.
Is it possible in Python?
Random Forest doesn't have the partial_fit() method, so I can't do an incremental fit.
try kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123) to evenly split your data
try kf=TimeSeriesSpit(n_splits=5) to split by time stamp
try kf = KFold(n_splits=5, random_state=123, shuffle=True) to shuffle your training data before splitting.
for train_index, test_index in kf.split(bryant_shots):
cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
#fit the classifier
you can also stratefy by groupings or categories and get mean averages for these groupings using kfold. It is super powerful for understanding your data.
It is best to join all subsets and then split them for k-fold but here is the other way:
for in range(10):
model = what_model_you_want
model.fit(dataset.drop(i_th_subset))
prediction = model.predict(i_th_subset)
test_result = compute_accuracy(i_th_subset.target, prediction)
I created a text classifier, and I'm trying to utilize K-fold cross-validation. I can't figure out why my first fold has an accuracy of 55% while my other folds are overfitting at 99-100% accuracy. My data set is a 5109x2 dataframe with columns df["Features"] as the features and df["Labels"] as labels. df["Features"] has descriptors based off some product mapping keywords and are separated by commas as seen here: Features. I'm creating indicator variables based off the sub-features through countvectorizer(). This is the result of a 5-fold cv. Result
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
def train(classifier, X, y):
count_vect=CountVectorizer(min_df = 1,lowercase = False)
y=pd.Series(y)
X=count_vect.fit_transform(X)
y=count_vect.fit_transform(y)
kf=KFold(n_splits=5,shuffle=True)
k_fold=pd.Series(np.zeros(5))
for i,(train_index,test_index) in enumerate(kf.split(X)):
print("Train",train_index, "Test",test_index)
X_train,X_test=X[train_index],X[test_index]
y_train,y_test=y[train_index],y[test_index]
k_fold[i]=(print("For K=",i+1," Classifier accuracy= ",classifier.fit(X_train, y_train).score(X_test, y_test), "n = ",X_train.shape[0]))
train(MLPClassifier(hidden_layer_sizes= (100,),activation='relu',random_state=2, max_iter=100, warm_start=True),df["Features"], df["Labels"])
It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.
Reading doc for k fold cross validation http://scikit-learn.org/stable/modules/cross_validation.html I'm attempting to understand the training procedure for each of the folds.
Is this correct :
In generating the cross_val_score each fold contains a new training and test set , these training and test sets are utilized by the passed in classifier clf in below code for evaluating each fold performance ?
This implies that increasing size of fold can affect accuracy depending on size of training set as increase number of folds reduces training data available for each fold ?
From doc cross_val_score is generated using :
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
I don't think the statement "each fold contains a new training and test set" is correct.
By default, cross_val_score uses KFold cross-validation. This works by splitting the data set into K equal folds. Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as follows:
Use fold1 and fold2 as your training set in svm and test performance on fold3.
Use fold1 and fold3 as our training set in svm and test performance on fold2.
Use fold2 and fold3 as our training set in svm and test performance on fold1.
So each fold is used for both training and testing.
Now to second part of your question. If you increase the number of rows of data in a fold, you do reduce the number of training samples for each of the runs (above, that would be run 1, 2, and 3) but the total number of training samples is unchanged.
Generally, selecting the right number of folds is both art and science. For some heuristics on how to choose your number of folds, I would suggest this answer. The bottom line is that accuracy can be slightly affected by your choice of the number of folds. For large data sets, you are relatively safe with a large number of folds; for smaller data sets, you should run the exercise multiple times with new random splits.
Is
class sklearn.cross_validation.ShuffleSplit(
n,
n_iterations=10,
test_fraction=0.10000000000000001,
indices=True,
random_state=None
)
the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)
Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.
If ShuffleSplit is the right, one concern is that it is mentioned
Note: contrary to other cross-validation strategies, random splits do not
guarantee that all folds will be different, although this is still
very likely for sizeable datasets
Is this always the case for 10*10 fold CV?
I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:
>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
... random_state=42)
If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):
>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
... X, y = shuffle(X_orig, y_orig, random_state=i)
... skf = StratifiedKFold(y, 10)
... print cross_val_score(clf, X, y, cv=skf)