Alternative to Using Repeated Stratified K Fold with Multiple Outputs? - python

I am exploring the number of features that would be best to use for my models. I understand that a Repeated Stratified K Fold requires 1 1D array output while I am trying to evaluate the number of features for an output that has multiple outputs. Is there a way to use the Repeated Stratified K Fold with multiple outputs? Or is there an alternative to accomplish what I need?
from sklearn import datasets
from numpy import mean, std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, KFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
def get_models():
models = dict()
for i in range(4,20):
rfe = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = i)
model = DecisionTreeClassifier()
models[str(i)] = Pipeline(steps=[('s', rfe), ('m', model)])
return models
from sklearn.utils.multiclass import type_of_target
x = imp_data.iloc[:,:34]
y = imp_data.iloc[:,39]
model = DecisionTreeClassifier()
def evaluate_model(model,x,y):
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=0)
scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score = 'raise')
return scores
models = get_models()
results, names = list(), list()
for name,model in models.items():
scores = evaluate_model(model,x,y)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

as i know, you can use cross_validate() as alternative of StratifiedKFold with multiple output. You can define cross validation technique with StratifiedKFold and scoring metrics as your preference. You can check link below for more detail !
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

Related

Adding feature scaling to a nested cross-validation with randomized search and recursive feature elimination

I have a classification task and want to use a repeated nested cross-validation to simultaneously perform hyperparameter tuning and feature selection. For this, I am running RandomizedSearchCV on RFECV using Python's sklearn library, as suggested in this SO answer.
However, I additionally need to scale my features and impute some missing values first. Those two steps should also be included into the CV framework to avoid information leakage between training and test folds. I tried to create a Pipeline to get there but I think it "destroys" my CV-nesting (i.e., performs the RFECV and random search separately from each other):
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
# create example data with missings
Xtrain, ytrain = make_classification(n_samples = 500,
n_features = 150,
n_informative = 25,
n_redundant = 125,
random_state = 1897)
c = 10000 # number of missings
Xtrain.ravel()[np.random.choice(Xtrain.size, c, replace = False)] = np.nan # introduce random missings
folds = 5
repeats = 5
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
n_iter = 100
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 1897)
sel = RFECV(sgdc, cv = folds)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('selector', sel),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 1897, verbose = 1, n_jobs = -1)
rskfold_search.fit(Xtrain, ytrain)
Does anyone know how to include scaling and imputation into the CV framework without losing the nesting of my RandomizedSearchCV and RFECV?
Any help is highly appreciated!
You haven't lost the nested cv.
You have a search object at the top level; when you call fit, it splits the data into multiple folds. Let's focus on one such train fold. Your pipeline gets fitted on that, so you scale and impute, then the RFECV gets it to split into inner folds. Finally a new estimator gets fitted on the outer training fold, and scored on the outer testing fold.
That means the RFE is getting perhaps a little leakage, since scaling and imputing happen before its splits. You can add them in a pipeline before the estimator, and use that pipeline as the RFE estimator. And since RFECV refits its estimator using the discovered optimal number of features and exposes that for predict and so on, you don't really need the second copy of sgdc; using just the one copy has the side effect of hyperparameter-tuning the selection as well:
scl = StandardScaler()
imp = KNNImputer(n_neighbors=5, weights='uniform')
sgdc = SGDClassifier(loss='log', penalty='elasticnet', class_weight='balanced', random_state=1897)
base_pipe = Pipeline([
('scaler', scl),
('imputer', imp),
('clf', sgdc),
])
sel = RFECV(base_pipe, cv=folds)
param_rand = {'estimator__clf__l1_ratio': stats.uniform(0, 1),
'estimator__clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(sel, param_rand, n_iter=n_iter, cv=rskfold, scoring='accuracy', random_state=1897, verbose=1, n_jobs=-1)

Newbie : How evaluate model to increase accuracy model in classification

my data
how do I increase the accuracy of the model, if some of my models when run produce results like the one below
`
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6780893042575286
`
Random Forest Classifier : Accuracy: 0.6780893042575286
There are several ways to achieve this:
Look at the data. Are they in the best shape for the algorithm? Regarding NaN, Covariance and so on? Are they normalized, are the categorical ones translated well? This is a question too far-reaching for a forum.
Look at the problem and the different algorithm suitable for this problem. Maybe
Logistic Regression
SVN
XGBoost
....
Try hyper parameter tuning with RandomisedsearvCV or GridSearchCV
This is quite high-level.
In terms of model selection, you can use a function like the below to find a good model that suits the problem.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def mutli_model(X_train, y_train, X_test, y_test):
""" Function to determine best model archietecture """
dfs = []
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC()),
('GNB', GaussianNB()),
('XGB', XGBClassifier(eval_metric="error"))
]
results = []
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
target_names = ['App_Status_1', 'App_Status_2']
for name, model in models:
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(name)
print(classification_report(y_test, y_pred, target_names=target_names))
results.append(cv_results)
names.append(name)
this_df = pd.DataFrame(cv_results)
this_df['model'] = name
dfs.append(this_df)
final = pd.concat(dfs, ignore_index=True)
return final
After model selection, you can do something called Hyperparameter tuning which will further increase the model's performance.
If you want to further improve the model, you implement techniques like Data Augmentation and also revisit the cleaning phase of your data.
If after all that, if it still doesn't improve you could try collecting more data or refocus the problem statement.

How do I incorporate SelectKBest in an SKlearn pipeline

I'm trying to build a text classifier with sklearn. The idea is to:
Vectorize training corpus using
TfidfVectorizer
Select the top 20,000 features that result (or using all features if the resultant number is below 20k) using SelectKBest
Feed these features into a Logistic Regression Classifier
I've set it up successfully as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(df_train["input"])
selector = SelectKBest(f_classif, k=min(20000, x_train.shape[1]))
selector.fit(x_train, df_train["label"].values)
x_train = selector.transform(x_train)
classifier = LogisticRegression()
classifier.fit(x_train, df_train["label"])
I would now like to wrap all this up into a pipeline, and share the pipeline so it can be used by others for their own text data. Yet, I can't figure how to get SelectKBest to achieve the same behavior as it did above, i.e. accept min(20000, n_features from vectorizer output) as k. If I were to simply leave it as k=20000 as below, the pipeline doesn't work (throws an error) when fitting new corpora with less than 20k vectorized features.
pipe = Pipeline([
("vect",TfidfVectorizer()),
("selector",SelectKBest(f_classif, k=20000)),
("clf",LogisticRegression())])
As #vivek kumar pointed out you need to override the _check_params method of SelectKBest and add your logic to it as shown below:
class MySelectKBest(SelectKBest):
def _check_params(self, X, y):
if (self.k >= X.shape[1]):
warnings.warn("Less than %d number of features found, so setting k as %d" % (self.k, X.shape[1]),
UserWarning)
self.k = X.shape[1]
if not (self.k == "all" or 0 <= self.k):
raise ValueError("k should be >=0, <= n_features = %d; got %r. "
"Use k='all' to return all features."
% (X.shape[1], self.k))
I have also set a warning in case if the number of features found is less than the threshold set. Now let's see a working example of the same:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import warnings
categories = ['alt.atheism', 'comp.graphics',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'misc.forsale', 'rec.autos']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target
# newsgroups result in 47K odd features after performing TFIDF vectorizer
# Case 1: When K < No. of features - the regular case
pipe = Pipeline([
("vect",TfidfVectorizer()),
("selector",MySelectKBest(f_classif, k=30000)),
("clf",LogisticRegression())])
pipe.fit(newsgroups.data, y_true)
pipe.score(newsgroups.data, y_true)
#0.968
#Case 2: When K > No. of cases - the one with an issue
pipe = Pipeline([
("vect",TfidfVectorizer()),
("selector",MySelectKBest(f_classif, k=50000)),
("clf",LogisticRegression())])
pipe.fit(newsgroups.data, y_true)
UserWarning: Less than 50000 number of features found, so setting k as 47407
pipe.score(newsgroups.data, y_true)
#0.9792

Failure to generate an AUC result for a classification model using Python 3.7 in Jupyter notebook

Here is the error message I get when the Python code below is entered. This code came from a machine learning ebook and I simply altered it to match the data set ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
Cross Validation Classification ROC AUC
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('Diabetes_Classification.csv')
array = df.values
X = array[:,0:14]
Y = array[:,14]
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression(solver='liblinear')
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

Cross validation with multiple parameters using f1-score

I am trying to do feature selection using SelectKBest and the best tree depth for binary classification using f1-score. I have created a scorer function to select the best features and to evaluate the grid search. An error of "call() missing 1 required positional argument: 'y_true'" pops up when the classifier is trying to fit to the training data.
#Define scorer
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f1_scorer, k=all)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
#initialize a grid search with features to be optimized
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
#order best selected features into a single variable
selector = SelectKBest(score_func=f1_scorer, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)
On the fit line I get a TypeError: __call__() missing 1 required positional argument: 'y_true'.
The problem is in the score_func which you have used for SelectKBest. score_func is a function which takes two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores, but in your code you have fed the callable f1_scorer as the score_func which just takes your y_true and y_pred and computes the f1 score. You can use one of chi2, f_classif or mutual_info_classif as your score_func for the classification task. Also, there is a minor bug in the parameter k for SelectKBest it should have been "all" instead of all. I have modified your code incorporating these changes,
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif
from sklearn.metrics import f1_score, make_scorer
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_classes=2,
n_informative=4, weights=[0.7, 0.3],
random_state=0)
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f_classif)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
gs.best_params_
OUTPUT
{'dt__max_depth': 6, 'kbest__k': 9}
Also modify your last two lines as below:
selector = SelectKBest(score_func=f_classif, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)

Categories

Resources