Repeated holdout method - python

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method

So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

Related

ValueError: Found input variables with inconsistent numbers of samples: [164309, 109541]

I've built a machine learning model from 2 data frames called df_test and df_train for naive Bayes, I run it with this code in my pycharm, but when I run it with this model, it returns:
ValueError: Found input variables with inconsistent numbers of samples: [164309, 109541].
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=1.0,random_state=109) # 70% training and 30% test
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gnb.predict(X_test)
Where have I gone wrong?
You want 70-30 split of your data but here you created 100% test data. Change test_size to 0.3(30%) instead of 1.0(100%).
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=0.3,random_state=109) # 70% training and 30% test

Python: I want to perform 5 fold cross validation for logistic regression and report scores. Do I use LogisticRegressionCV() or cross_val_score()?

cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154

F1 score for multiclass labeling cross validation

I want to get the F1 score for each of the classes (I have 4 classes) and for each of the cross-validation folds. clf is my trained model, X_test is the features and y_test the labels of the test set. Since I am doing 5-fold cross-validation, I am supposed to get 4 F1 scores for each class on the first fold, 4 on the second... total of 20. Can I do this in python in a simple way?
The following line will give me the average F1 for all the classes, just 5 values for each fold. I checked the options for the variable scoring in the cross_val_score (https://scikit-learn.org/stable/modules/model_evaluation.html) and it seems like I cannot get the F1 score for each class in each fold (or maybe I am lost somewhere).
scores = cross_val_score(clf, X_test, y_test, cv=5, scoring='f1_macro')
Ok, I found a solution. X is my dataframe of the features and y the labels.
f1_score(y_test, y_pred, average=None)gives the F1 scores for each class, without aggregation. So each of the folds, we train the model and we try it on the test set.
from sklearn.model_selection import KFold
cv = KFold(n_splits=5, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average=None))
Then, the result will be:
[0.99320793 0.79749478 0.34782609 0.44243792]
[0.99352309 0.82583622 0.34615385 0.48873874]
[0.99294785 0.78794403 0.28571429 0.42403628]
[0.99324611 0.79236813 0.31654676 0.43778802]
[0.99327615 0.79136691 0.32704403 0.42410197]
where each line has the F1 scores for each fold and each value represents the F1 score of each class.
If there is a shorter & simpler solution to this, please, feel free to post it.

How to shuffle data each time when using cross_val_score?

When training a Ridge Classifier, I'm able to perform 10 fold cross validation like so:
clf = linear_model.RidgeClassifier()
n_folds = 10
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
scores
array([0.83236107, 0.83937346, 0.84490172, 0.82985258, 0.84336609,
0.83753071, 0.83753071, 0.84213759, 0.84121622, 0.84398034])
If I want to perform 10 fold cross validation again, and I use:
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
I end up with the same results.
Thus, it seems the data is being split the same way both times.
Is there a way to randomly partition the data into n_folds every time I perform cross validation?
What you will want to do is create your own instances of Stratified K Folds object and pass it in to the cv argument in cross_val_score. This way you can supply different random seeds for splitting the data.
from sklearn.model_selection import StratifiedKFold
clf = linear_model.RidgeClassifier()
for n in range(5):
strat_k_fold = StratifiedKFold(n_splits=10, shuffle=False, random_state=n)
scores = cross_val_score(clf, X_train, y_train, cv=strat_k_fold)

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test
You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9
Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.
In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data
StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]
As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)
Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Categories

Resources