I have a program which implements KFold in Logistic RegressionCV. I have set up a seed and use that in both KFOLD and LogisticRegressionCV. Even if the seed is set, I get a different measure of all my metrics every time I re-run the kernel. Here is the code:
rs = random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt, test_size=0.1,
random_state=42)
kf = KFold(n_splits=15, shuffle=flase, random_state=42)
logistic = LogisticRegressionCV(Cs=2, fit_intercept=True, cv=kf, verbose =1, random_state=42)
logistic.fit(X_train, y_train)
print("Train Coefficient:" , logistic.coef_) #weights of each feature
print("Train Intercept:" , logistic.intercept_) #value of intercept
print("\n \n \n ")
logistic.predict(X_test)
test_precision = metrics.precision_score(y_test, logistic.predict(X_test))
test_avg_precision = metrics.average_precision_score(y_test, logistic.predict(X_test))
What can be a reason of that and if there is a simple solution to this.
According to SKlearn document here: Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
However, it might only set the fold random state but not shuffling. Try setting shuffle=False and see if you still get different result.
Related
How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).
cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154
When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.
When training a Ridge Classifier, I'm able to perform 10 fold cross validation like so:
clf = linear_model.RidgeClassifier()
n_folds = 10
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
scores
array([0.83236107, 0.83937346, 0.84490172, 0.82985258, 0.84336609,
0.83753071, 0.83753071, 0.84213759, 0.84121622, 0.84398034])
If I want to perform 10 fold cross validation again, and I use:
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
I end up with the same results.
Thus, it seems the data is being split the same way both times.
Is there a way to randomly partition the data into n_folds every time I perform cross validation?
What you will want to do is create your own instances of Stratified K Folds object and pass it in to the cv argument in cross_val_score. This way you can supply different random seeds for splitting the data.
from sklearn.model_selection import StratifiedKFold
clf = linear_model.RidgeClassifier()
for n in range(5):
strat_k_fold = StratifiedKFold(n_splits=10, shuffle=False, random_state=n)
scores = cross_val_score(clf, X_train, y_train, cv=strat_k_fold)
I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.