Why can't I match LGBM's cv score? - python

I'm unable to match LGBM's cv score by hand.
Here's a MCVE:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
folds = KFold(5, random_state=42)
params = {'random_state': 42}
results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])
print('LGBM\'s cv score: ', results['auc-mean'][-1])
clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))
val_scores = []
for train_idx, val_idx in folds.split(X_train):
clf.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
val_scores.append(roc_auc_score(y_train.iloc[val_idx], clf.predict_proba(X_train.iloc[val_idx])[:,1]))
print('Manual score: ', np.mean(np.array(val_scores)))
I was expecting the two CV scores to be identical - I have set random seeds, and done exactly the same thing. Yet they differ.
Here's the output I get:
LGBM's cv score: 0.9851513530737058
Manual score: 0.9903622177441328
Why? Am I not using LGMB's cv module correctly?

You are splitting X into X_train and X_test.
For cv you split X_train into 5 folds while manually you split X into 5 folds. i.e you use more points manually than with cv.
change results = lgb.cv(params, lgb.Dataset(X_train, y_train) to results = lgb.cv(params, lgb.Dataset(X, y)
Futhermore, there can be different parameters. For example, the number of threads used by lightgbm changes the result. During cv the models are fitted in parallel. Hence the number of threads used might differ from your manual sequential training.
EDIT after 1st correction:
You can achieve the same results using manual splitting / cv using this code:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
folds = KFold(5, random_state=42)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective':'binary',
'metric':'auc',
}
data_all = lgb.Dataset(X_train, y_train)
results = lgb.cv(params, data_all,
folds=folds.split(X_train),
num_boost_round=1000,
early_stopping_rounds=100)
print('LGBM\'s cv score: ', results['auc-mean'][-1])
val_scores = []
for train_idx, val_idx in folds.split(X_train):
data_trd = lgb.Dataset(X_train.iloc[train_idx],
y_train.iloc[train_idx],
reference=data_all)
gbm = lgb.train(params,
data_trd,
num_boost_round=len(results['auc-mean']),
verbose_eval=100)
val_scores.append(roc_auc_score(y_train.iloc[val_idx], gbm.predict(X_train.iloc[val_idx])))
print('Manual score: ', np.mean(np.array(val_scores)))
yields
LGBM's cv score: 0.9914524426410262
Manual score: 0.9914524426410262
What makes the difference is this line reference=data_all. During cv, the binning of the variables (refers to lightgbm doc) is constructed using the whole dataset (X_train) while in you manual for loop it was built on the training subset (X_train.iloc[train_idx]). By passing the reference to the dataset containg all the data, lightGBM will reuse the same binning, giving same results.

Related

cross validation for split test and train datasets

Unlike standart data, I have dataset contain separetly as train, test1 and test2. I implemented ML algorithms and got performance metrics. But when i apply cross validation, it's getting complicated.. May be someone help me.. Thank you..
It's my code..
train = pd.read_csv('train-alldata.csv',sep=";")
test = pd.read_csv('test1-alldata.csv',sep=";")
test2 = pd.read_csv('test2-alldata.csv',sep=";")
X_train = train_pca_son.drop('churn_yn',axis=1)
y_train = train_pca_son['churn_yn']
X_test = test_pca_son.drop('churn_yn',axis=1)
y_test = test_pca_son['churn_yn']
X_test_2 = test2_pca_son.drop('churn_yn',axis=1)
y_test_2 = test2_pca_son['churn_yn']
For example, KNN Classifier.
knn_classifier = KNeighborsClassifier(n_neighbors =7,metric='euclidean')
knn_classifier.fit(X_train, y_train)
For K-Fold.
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
dtc = DecisionTreeClassifier(random_state=42)
k_folds = KFold(n_splits = 5)
scores = cross_val_score(dtc, X, y, cv = k_folds)
print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
This is a variation on the "holdout test data" pattern (see also: Wikipedia: Training, Validation, Test / Confusion in terminology). For churn prediction: this may arise if you have two types of customers, or are evaluating on two time frames.
X_train, y_train ← perform training and hyperparameter tuning with this
X_test1, y_test1 ← test on this
X_test2, y_test2 ← test on this as well
Cross validation estimates holdout error using the training data—it may come up if you estimate hyperparameters with GridSearchCV. Final evaluation involves estimating performance on two test sets, separately or averaged over the two:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_test1, X_test2, y_test1, y_test2 = train_test_split(X_test, y_test, test_size=.5)
print(y_train.shape, y_test1.shape, y_test2.shape)
# (600,) (200,) (200,)
clf = KNeighborsClassifier(n_neighbors=7).fit(X_train, y_train)
print(f1_score(y_test1, clf.predict(X_test1)))
print(f1_score(y_test2, clf.predict(X_test2)))
# 0.819
# 0.805

Applying Cross validation in Naive bayes

My dataset is Spam and Ham Filipino Message
I divided my dataset into 60% training, 20% testing and 20%validation
Split data into testing, training and Validation
from sklearn.model_selection import train_test_split
data['label'] = (data['label'].replace({'ham' : 0,
'spam' : 1}))
X_train, X_test, y_train, y_test = train_test_split(data['message'],
data['label'], test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))
Train a MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
naive_bayes = MultinomialNB().fit(train_data,
y_train)
predictions = naive_bayes.predict(test_data)
Evaluate the Model
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
accuracy_score = accuracy_score(y_test,
predictions)
precision_score = precision_score(y_test,
predictions)
recall_score = recall_score(y_test,
predictions)
f1_score = f1_score(y_test,
predictions)
My problem is in Validation. The error says
warnings.warn("Estimator fit failed. The score on this train-test"
this is how I code my validation, don't know if I'm doing the right thing"
from sklearn.model_selection import cross_val_score
mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
I did not get any error or warning. Maybe it can be worked.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("https://raw.githubusercontent.com/jeffprosise/Machine-Learning/master/Data/ham-spam.csv")
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
x = vectorizer.fit_transform(df['Text'])
y = df['IsSpam']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))
naive_bayes = MultinomialNB().fit(X_train, y_train)
predictions = naive_bayes.predict(X_test)
accuracy_score = accuracy_score(y_test,predictions)
precision_score = precision_score(y_test, predictions)
recall_score = recall_score(y_test, predictions)
f1_score = f1_score(y_test, predictions)
mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
Result:
Total: 1000 rows
Train: 600 rows
Test: 200 rows
Validation: 200 rows
Cross-validation scores:[1. 0.95 0.85 1. 1. 0.9 0.9 0.8 0.9 0.9 ]
First, it is worth noting that because it's called cross validation doesn't mean you have to use a validation set as you have done in your code, to do the crossval. There are a number of reasons why you would perform cross validation which include:
Ensuring that all your dataset is used in training as well as evaluating the performance of your model
To perform hyperparameter tuning.
Hence, your case here lean toward the first use case. As such you don't need to first perform a split of train, val, and test. Instead you can perform the 10-fold cross validation on your entire dataset.
If you are doing hyparameterization, then you can have a hold-out set of say 30% and use the remaining 70% for cross validation. Once the best parameters have been determined, you can then use the hold-out set to perform an evaluation of the model with the best parameters.
Some refs:
https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79
https://www.analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

How to evaluate the effect of different methods of handling missing values?

I am a total beginner and I am trying to compare different methods of handling missing data. In order to evaluate the effect of each method (drop raws with missing values, drop columns with missigness over 40%, impute with the mean, impute with the KNN), I compare the results of the LDA accuracy and LogReg accuracy on the training set between a dataset with 10% missing values, 20% missing values against the results of the original complete dataset. Unfortunately, I get pretty much the same results even between the complete dataset and the dataset with 20% missing-ness. I don't know what I am doing wrong.
from numpy import nan
from numpy import isnan
from pandas import read_csv
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#dataset = read_csv('telecom_churn_rev10.csv')
dataset = read_csv('telecom_churn_rev20.csv')
dataset = dataset.replace(nan, 0)
values = dataset.values
X = values[:,1:11]
y = values[:,0]
dataset.fillna(dataset.mean(), inplace=True)
#dataset.fillna(dataset.mode(), inplace=True)
print(dataset.isnull().sum())
imputer = SimpleImputer(missing_values = nan, strategy = 'mean')
transformed_values = imputer.fit_transform(X)
print('Missing: %d' % isnan(transformed_values).sum())
model = LinearDiscriminantAnalysis()
cv = KFold(n_splits = 3, shuffle = True, random_state = 1)
result = cross_val_score(model, X, y, cv = cv, scoring = 'accuracy')
print('Accuracy: %.3f' % result.mean())
#print('Accuracy: %.3f' % result.mode())
print(dataset.describe())
print(dataset.head(20))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test,y_pred)
from sklearn import metrics
# make predictions on X
expected = y
predicted = classifier.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
# make predictions on X test
expected = y_test
predicted = classifier.predict(X_test)
# summarize the fit of the model
print(metrics.confusion_matrix(expected, predicted))
print(metrics.classification_report(expected, predicted))
You replace all your missing values with 0 at that line : dataset = dataset.replace(nan, 0). After this line, you have a full dataset without missing values. So, the .fillna() and the SimpleImputer() are useless after that line.

Why Score from LinearRegression is giving different result than r2_score from sklearn.metrics

Ideally I should get same result as score is nothing but R-Square. But not sure why results are coming different.
from sklearn.datasets import california_housing
data = california_housing.fetch_california_housing()
data.data.shape
data.feature_names
data.target_names
import pandas as pd
house_data = pd.DataFrame(data.data, columns=data.feature_names)
house_data.describe()
house_data['Price'] = data.target
X = house_data.iloc[:, 0:8].values
y = house_data.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
#Check R-square on training data
from sklearn.metrics import mean_squared_error, r2_score
y_pred = linear_model.predict(X_test)
print(linear_model.score(X_test, y_test))
print(r2_score(y_pred, y_test))
Output
0.5957643114594776
0.34460597952465033
from the docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
sklearn.metrics.r2_score(y_true, y_pred,...)
You are passing y_true and y_pred the wrong way around. If you switch them you get the correct result.
print(linear_model.score(X_test, y_test))
print(r2_score(y_test, y_pred))
0.5957643114594777
0.5957643114594777

Difference between sklearn KFold with and without using shuffle

I am working with KFlold using sklearn version 0.22. It has a parameter shuffle.
According to the documentation
shuffle boolean, optional Whether to shuffle the data before splitting
into batches.
I ran a simple comparison of using KFold with shuffle set to False (default) and True:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics
X, y = load_digits(return_X_y=True)
def run_nfold(X,y, classifier, scorer, cv, n_repeats):
results = []
for n in range(n_repeats):
for train_index, test_index in cv.split(X, y):
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
classifier.fit(x_train, y_train)
results.append(scorer(y_test, classifier.predict(x_test)))
return results
classifier = SGDClassifier(loss='hinge', penalty='elasticnet', fit_intercept=True)
scorer = metrics.accuracy_score
n_splits = 5
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))
kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))
produces
KFold mean = 0.9119255648406066
KFold Shuffled mean = 0.9505304859176724
Using Kolmogorov-Smirnov test:
print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
shows the default non-shuffled KFold produces statistically significant lower results than the shuffled KFold:
Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)
I don't understand the difference between the shuffling vs non-shuffling results, why does it change the distribution of the output so drastically.

Categories

Resources