Scoring multiple output variables with scikit-learn - python

I am making a regressor which is supposed to output two continuous variables a and b.
The problem is: When you use cross_val_score from scikit-learn to evaluate the performance, then you per default get a score across the output variables. I want to get a score for each one. Specifically I want the R2 measure of both a and b.
I haven't been able to find out how to do it yet. Anyone who can help?
Reproducible code below:
import pandas as pd
import math
import os
import numpy as np
from sklearn import linear_model
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import ElasticNet
from sklearn.neural_network import MLPRegressor
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
def trainModel():
# Array to store scores
nested_scores = np.zeros(NUM_TRIALS)
# Loop for each trial
for i in range(NUM_TRIALS):
# Choose cross-validation inner and outer
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)
# Parameter search in outer cross val
clf = GridSearchCV(estimator=reg, param_grid=p_grid, cv=inner_cv)
clf.fit(X_values, y_values)
# Model selection in outer cross val
nested_score = cross_val_score(clf, X=X_values, y=y_values, cv=outer_cv, scoring='r2')
nested_scores[i] = nested_score.mean()
print("nested_score",nested_score)
print("R2:",nested_scores)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 6)))
df.columns=['a','b','c','d','e','f']
#PRE-PROCESSING
NUM_COLUMNS = df.shape[1]
X_values = np.array(df.iloc[:,[0,1,2,3]])
y_values = np.array(df.iloc[:,[NUM_COLUMNS-2,NUM_COLUMNS-1]])
print("pre-processing done!")
#MODEL TRAINING
NUM_TRIALS = 10
#ELASTIC NET
print("\nELASTIC NET")
p_grid = {"alpha": [0.2, 0.5, 1, 1.5, 2, 3],
"l1_ratio": [0.2, 0.3, 0.4, 0.5, 1]}
reg = ElasticNet(random_state=0)
trainModel()
#NEURAL NETWORK
print("\nNEURAL NETWORK")
p_grid = {"alpha": [0.2, 0.5, 1, 1.5, 2, 3],
"hidden_layer_sizes": list(range(1,6))}
reg = MLPRegressor(solver='lbfgs', random_state=1)
trainModel()
Dummy data:
So basically I have two y-values, and I want the R2 statistic for both of the variables instead of one statistic across the variables. Let me know if you have any questions.
X_values (the input variables (4))
y_values (the output variables (2))
Output for 10 trials with 4-fold cross val for an Elastic Net and Neural Network model. The last line with "R2:" is the average over the folds.
**

Related

Train, test score disrepancy with datasize

I'm trying to apply ML on atomic structures using descriptors. My problem is that I get very different score values depending on the datasize I use, I suspect that something is wrong with my model, any suggestions would be appreciated. I used dataset from this paper (Dataset MoS2(single)).
Here is the my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ase
from dscribe.descriptors import SOAP
from dscribe.descriptors import CoulombMatrix
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from ase.io import read
materials = read('structures.xyz', index=':')
materials = materials[:5000]
energies = pd.read_csv('Energy.csv')
energies = np.array(energies['b'])
energies = energies[:5000]
species = ["H", 'Mo', 'S']
rcut = 8.0
nmax = 1
lmax = 1
# Setting up the SOAP descriptor
soap = SOAP(
species=species,
periodic=False,
rcut=rcut,
nmax=nmax,
lmax=lmax,
)
coulomb_matrices = soap.create(materials, positions=[[51]]*len(materials))
nsamples, nx, ny = coulomb_matrices.shape
d2_train_dataset = coulomb_matrices.reshape((nsamples,nx*ny))
df = pd.DataFrame(d2_train_dataset)
df['target'] = energies
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, 0:12].values
y = df.iloc[:, 12:].values
st_x = StandardScaler()
st_y = StandardScaler()
X = st_x.fit_transform(X)
y = st_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#krr = GridSearchCV(
# KernelRidge(kernel="rbf", gamma=0.1),
# param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.logspace(-2, 2, 5)},
#)
svr = GridSearchCV(
SVR(kernel="rbf", gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3], "gamma": np.logspace(-2, 2, 5)},
)
svr = svr.fit(X_train, y_train.ravel())
print("Training set score: {:.4f}".format(svr.score(X_train, y_train)))
print("Test set score: {:.4f}".format(svr.score(X_test, y_test)))
and score:
Training set score: 0.0414
Test set score: 0.9126
I don't have a full answer to your problem as recreating it would be very cumbersome, but here are some questions to check:
a) You are training on 5 CrossValidation folds (default). First you should check the results of all parameter combinations right after the fitting process with "svr.best_score_" (or more detailed with "svr.cv_results_dict") and see what mean score your folds actually produced. If the score is really is as low as 0.04 (I assume higher is better, which these scores usually do), taking the reciprocal prediction would actually be really accurate! If you know you're always wrong, it's really easy to be right. ;D
b) You could go ahead and just use the svr.best_params_ in order to train again on the whole X_train-set instead of the folds (this can also be achieved with the "refit"-option of RandomSearchCV as well) and then check with the test set again. Here could also be the actual error: The documentation for the score method of GridSearchCV reads: "Return the score on the given data, if the estimator has been refit." This is not the case in your gridsearch! Try turning the refit option on. Maybe that works? ... sorry, your code was too cumbersome to be replicated fast, so I didn't check myself ...

increasing SVM model accuracy using hyperparameters

I'm building a language detection model based on how the letters appear in one word rather than how words appear in a sentence so the model is expected to predict the language of a single word at a time. The languages I'm using in my training data are English, Hungarian and Latin. The data is arranged in a manner that every single word is in one line.
I have tried to accomplish that using linear regression and the accuracy was around 80 % so I tried to solve the problem using SVM. After that, I did a grid search and the results weren't significantly improved.
I need the accuracy to be around 96% percent. Should I consider a different model or could this be improved?
You can find the data files [Here][1].
[Google_Colab_notebook][2]
confusion matrices for SVM and linear regression
SVM confusion matrix
here are the confusion matrices :
First SVM [SVM][3].First Linear regresion[LR][4].Optimized SVM [Optimized SVM ][5].
import string
import re
import codecs
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn import linear_model
from sklearn import pipeline
from sklearn import feature_extraction
from sklearn.model_selection import train_test_split
from google.colab import drive
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix
import time
from sklearn.utils import resample
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
#load the drive
drive.mount('/content/gdrive/',force_remount=True)
#load raw training files
hun_df=pd.read_csv("/content/gdrive/MyDrive/language_detection_files/hun_words.txt","utf-8",header=None,names=["Hungarian"])
eng_df=pd.read_csv("/content/gdrive/MyDrive/language_detection_files/eng_words.txt","utf-8",header=None,names=["English"])
lat_df=pd.read_csv("/content/gdrive/MyDrive/language_detection_files/lat_words.txt","utf-8",header=None,names=["Latin"])
#regular expression to clean the data
string_pattern = r'\W|\d+|[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]'
regex_pattern = re.compile(string_pattern) # compile string pattern to re.Pattern object
#natural language processing toolkit
!pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('hungarian'))
#cleaning and putting the words into a list
data_hun=[]
lang_hun=[]
for i,line in hun_df.iterrows():
line = line['Hungarian']
line=line.lower()
line = re.sub(regex_pattern,'', line)
if not line in stop_words and len(line) > 1 and len(line) < 14: # trying to downsample to balance the classes
data_hun.append(line)
lang_hun.append("Hungarian") #len(data_hun) is 254211
stop_words = set(stopwords.words('english'))
data_eng=[]
lang_eng=[]
for i,line in eng_df.iterrows():
line = line['English']
line=line.lower()
line = re.sub(regex_pattern,'', line) # consider compiling the regex in a different cell
if not line in stop_words and len(line) > 1 and len(line) < 14: # trying to downsample to balance the classes
data_eng.append(line)
lang_eng.append("English") # len(data_eng) 336610
data_lat=[]
lang_lat=[]
for i,line in lat_df.iterrows():
line = line['Latin']
line=line.lower()
line = re.sub(regex_pattern,'', line)
if len(line)>1 and len(line) < 14 :
data_lat.append(line)
lang_lat.append("Latin") # len(data_lat) 185249
number_of_samples=10000
#downsample and balance data
lat = resample(data_lat,
replace=True,
n_samples=number_of_samples,
random_state=42)
hun = resample(data_hun,
replace=True,
n_samples=number_of_samples,
random_state=42)
eng = resample(data_eng,
replace=True,
n_samples=number_of_samples,
random_state=42)
lang_lat = lang_lat[0:number_of_samples]
lang_hun = lang_hun[0:number_of_samples]
lang_eng = lang_eng[0:number_of_samples]
df=pd.DataFrame( {"Text":eng+hun+lat,
"Language":lang_eng+lang_hun+lang_lat})
print(df.shape)
# splitting the data to train and test
X,y=df.iloc[:,0],df.iloc[:,1]
X_train,X_test,y_train,y_test= train_test_split(X,
y,
test_size=0.2,
random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# vectorize the cleaned text every letter is vectorized
vectorizer= feature_extraction.text.TfidfVectorizer(ngram_range=(1,5),analyzer='char')
LR_pipeline=pipeline.Pipeline([
('vectorizer',vectorizer),
('clf',linear_model.LogisticRegression(solver='newton-cg',max_iter=1000)
) # {‘newton-cg’, ‘lbfgs’, , ‘sag’, ‘saga’}
])
#linear regression model prepare and fit
start = time.time()
LR_pipeline.fit(X_train,y_train)
end = round(time.time()-start,2)
end
y_predicted=LR_pipeline.predict(X_test)
acc=(metrics.accuracy_score(y_test,y_predicted))*100
print(acc,'%')
#linear regression model metrics
print(classification_report(y_test, y_predicted))
#prepare data for the SVM model and fit it
Tfidf_vect = feature_extraction.text.TfidfVectorizer(ngram_range=(1,5),analyzer='char')
Tfidf_vect.fit(df.iloc[:,0])
X_train_Tfidf = Tfidf_vect.transform(X_train)
X_test_Tfidf = Tfidf_vect.transform(X_test)
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto',verbose=True)
start = time.time()
SVM.fit(X_train_Tfidf,y_train)# predict the labels on validation dataset
end = round(time.time()-start,2)
end
y_predicted=SVM.predict(X_test_Tfidf)
acc=(metrics.accuracy_score(y_test,y_predicted))*100
print(acc,'%')
#SVM classification metrics
print(classification_report(y_test, y_predicted))
# The two confusion matrices side by side
plot_confusion_matrix(SVM, X_test_Tfidf, y_test)
plot_confusion_matrix(LR_pipeline, X_test, y_test)
plt.show()
# defining parameter range to perform a grid search and optimize hyper parameters
param_grid = {'C': [ 100, 1000], #[0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
# fitting the model for grid search
grid.fit(X_train_Tfidf,y_train)
# defining parameter range
param_grid = {'C': [ 0.1,1, 10, 100, 1000], #[0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
# fitting the model for grid search
grid.fit(X_train_Tfidf,y_train)
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000], #[0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
# fitting the model for grid search
grid.fit(X_train_Tfidf,y_train)
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
'''
best parameters
{'C': 1, 'gamma': 1, 'kernel': 'linear'}
SVC(C=1, gamma=1, kernel='linear')
'''
SVM_opt = svm.SVC(C=1, kernel='linear', degree=3, gamma=1,verbose=True)
start = time.time()
SVM_opt.fit(X_train_Tfidf,y_train)# predict the labels on validation dataset
end = round(time.time()-start,2)
end
y_predicted=SVM_opt.predict(X_test_Tfidf)
acc=(metrics.accuracy_score(y_test,y_predicted))*100
print(acc,'%')
# metrics and confusion matrix
print(classification_report(y_test, y_predicted))
plot_confusion_matrix(SVM_opt, X_test_Tfidf, y_test)
plt.show()
*[1]: https://drive.google.com/drive/folders/1nKINXfDG4IbEw4YlUzGTaRpJh4CFwGlC?usp=sharing
[2]: https://colab.research.google.com/drive/1FJNd7vXcHnrNZYz9aM56U1TalhVBCvO7?usp=sharing
[3]: https://i.stack.imgur.com/SeDh7.png
[4]: https://i.stack.imgur.com/Ksb7m.png
[5]: https://i.stack.imgur.com/Iw35J.png

Why is my MSE so high when the difference between test and prediction values are so close?

In Python, I have conducted a small multiple linear regression model to explain house prices in areas based on other variables (all of which are percentages multiplied by 100) such as percentage of people with bachelor degrees in an area, percentage of people who work from home. I have conducted this in R and it works fine, but I am new to Python ML. I have shown the output of y_pred = regressor.predict(X_test) and the MSE I get. I have included a sample of my data where avgincome PctSingleDetached and PctDrivetoWork are X, and AvgHousingPrice is the Y.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
sample data:
avgincome PctSingleDetached PctDrivetoWork AvgHousingPrice
0 44388.0 61.528497 81.151832 448954
1 40650.0 54.372197 77.882798 349758
2 43350.0 68.393782 79.553265 428740
X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer.
# Instructs to find missing and replace it with mean
# Fit method in SimpleImputer will connect imputer to our matrix of features
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))
print(X)
print(Y)
## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
### Feature Scaling ###
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both
X_test[:, 0:] = sc.transform(X_test[:, 0:])
print(X_train)
print(X_test)
## Training ##
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)
### Predicting Test Set results ###
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, y_pred)
print(mse)
OUTPUT:
[[489066.76 300334. ]
[227458.2 200352. ]
[928249.59 946729. ]
[339032.27 350116. ]
[689668.21 600322. ]
[489179.58 577936. ]]
...
...
MSE = 2375985640.8102403
You can calculate mse yourself to check if there is something wrong. In my opinion the obtained result is coherent. Anyway I built a simple my_mse function to check the result output by sklearn, with your example data
from sklearn.metrics import mean_squared_error
list_ = [[489066.76, 300334.],
[227458.2, 200352. ],
[928249.59, 946729. ],
[339032.27, 350116. ],
[689668.21, 600322. ],
[489179.58, 577936. ]]
y_true = [y[0] for y in list_]
y_pred = [y[1] for y in list_]
mse = mean_squared_error(y_true, y_pred)
print(mse)
# 8779930962.14985
def my_mse(y_true, y_pred):
diff = 0
for couple in zip(y_true, y_pred):
diff+=pow(couple[0]-couple[1], 2)
return diff/len(y_true)
print(my_mse(y_true, y_pred))
# 8779930962.14985
Remember the mse is the mean squared error. (Each error is squared in the sum)
If you are asking if your model is bad or good, it depends on the main objective. Anyway, I think that your model is performing poor because it's a linear model. A model with more complexity could handle the problem and output better results

How to remove outlier

I'm working on n a regression problem. I have 10 independent variables.I'm using SVR. Despite doing feature selection and tuning SVR parameters Using Grid search, I got huge MAPE which is 15%. So I'm trying to remove outliers but after removing them I cannot split the data. My question is do outliers affect the accuracy of regression?
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
import pandas as pd
from sklearn import preprocessing
features=pd.read_csv('selectedData.csv')
target = features['SYSLoad']
features= features.drop('SYSLoad', axis = 1)
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(features))
print(z)
threshold = 3
print(np.where(z > 3))
features2 = features[(z < 3).all(axis=1)]
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(features2, target, test_size = 0.25, random_state = 42)
while executing the following code I get this error.
"samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of
samples: [33352, 35064]"
You get the error because, while your target variable is of equal length with features (presumably 35064) due to:
target = features['SYSLoad']
your features2 variable is of lesser length (presumably 33352), i.e. it is a subset of features, due to:
features2 = features[(z < 3).all(axis=1)]
and your train_test_split justifiably complains that the lengths of your features & labels are not equal.
So, you should also subset your target accordingly, and use this target2 in your train_test_split:
target2 = target[(z < 3).all(axis=1)]
train_input, test_input, train_target, test_target = train_test_split(features2, target2, test_size = 0.25, random_state = 42)

Confusion Matrix for Leave-One-Out Cross Validation in sklearn

I know how to draw confusion matrix when I use the train and test split using sklearn but I do not know how to create the confusion matrix when I am using the leave-one-out cross validation as shown in this example:
# Evaluate using Leave One Out Cross Validation
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
loocv = model_selection.LeaveOneOut()
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
How should I create the confusion matrix for LOOCV in order to visualize the per-class accuracy?
Borrowing your method from here, you can work around the problem via creating a custom scorer that receives the metadata during the iterations. These metadata can be used to find: F1 Score, Precision, Recall, Accuracy as well as the Confusion Matrix!
Here we need another trick that is using GridSearchCV which accepts a custom scorer, so here we go!
Here is an example that you can work on more according to your absolute requirements:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Your method from the link you provided
def cm_analysis(y_true, y_pred, labels, ymap=None, figsize=(10,10)):
if ymap is not None:
y_pred = [ymap[yi] for yi in y_pred]
y_true = [ymap[yi] for yi in y_true]
labels = [ymap[yi] for yi in labels]
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_sum = np.sum(cm, axis=1, keepdims=True)
cm_perc = cm / cm_sum.astype(float) * 100
annot = np.empty_like(cm).astype(str)
nrows, ncols = cm.shape
for i in range(nrows):
for j in range(ncols):
c = cm[i, j]
p = cm_perc[i, j]
if i == j:
s = cm_sum[i]
annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
elif c == 0:
annot[i, j] = ''
else:
annot[i, j] = '%.1f%%\n%d' % (p, c)
cm = pd.DataFrame(cm, index=labels, columns=labels)
cm.index.name = 'Actual'
cm.columns.name = 'Predicted'
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(cm, annot=annot, fmt='', ax=ax)
#plt.savefig(filename)
plt.show()
# Custom Scorer
def my_scorer(y_true, y_pred):
acc = accuracy_score(y_true, y_pred)
# you can either save y_true, y_pred and accuracy into a file
# for later use with the info in clf.cv_results_
# or plot the confusion matrix right here!
# for labels, you can create a class attribute to make it more dynamic
# i.e. changes automatically with every new dataset!
cm_analysis(y_true, y_pred, labels=[0,1], ymap=None, figsize=(10, 10))
# N.B as long as you have y_true and y_pred from every round here, you can
# do with them all the metrics that want such as F1 Score, Precision, Recall, A
# ccuracy and the Confusion Matrix!
return acc
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
array = df.values
X = np.array(array[:,0:8])
Y = np.array(array[:,8]).astype(int)
# I'll make it two just for submitting the result here!
num_folds = 2
skf = StratifiedKFold(n_splits=num_folds, random_state=0)
# this is just a trick because the list contains
# the default parameter only (i.e. useless)
param_grid = {'C': [1.0]}
model = LogisticRegression()
# create custom scorer
custom_scorer = make_scorer(my_scorer)
# pass it to the GridSearchCV
clf = GridSearchCV(model, param_grid, scoring=custom_scorer, cv=skf, return_train_score=True)
# Fit and Go
clf.fit(X,Y)
# cv_results_ is a dict with all CV results during the iterations!
# IDK, you may need it to combine its content with the metrics ..etc
print(clf.cv_results_)
Result
{'mean_score_time': array([0.09023476]), 'split0_train_score':
array([0.79166667]), 'mean_train_score': array([0.77864583]),
'params': [{'C': 1.0}], 'std_test_score': array([0.01953125]),
'mean_fit_time': array([0.00235796]),
'param_C': masked_array(data=[1.0], mask=[False], fill_value='?',
dtype=object), 'rank_test_score': array([1], dtype=int32),
'split1_test_score': array([0.7734375]),
'std_fit_time': array([0.00032902]), 'mean_test_score': array([0.75390625]),
'std_score_time': array([0.00237632]), 'split1_train_score': array([0.765625]),
'split0_test_score': array([0.734375]), 'std_train_score': array([0.01302083])}
Split 0
Split 1
EDIT
If you strictly want LOOCV, then you can apply it in the above code, just replace StratifiedKFold by LeaveOneOut function; but bear in mind that LeaveOneOut will iterate around 684 times! so it's computationally very expensive. However, that would give you the confusion matrix in details during the iterations (i.e. metadata).
Nevertheless, if you are seeking the confusion matrix of the overall (i.e. final) process then you will still need to use the GridSearchCV but like follow:
......
loocv = LeaveOneOut()
clf = GridSearchCV(model, param_grid, scoring='accuracy', cv=loocv)
clf.fit(X,Y)
y_pred = clf.best_estimator_.predict(X)
cm_analysis(Y, y_pred, labels=[0, 1], ymap=None, figsize=(10,10))
Result

Categories

Resources