I am using a OneVsRestClassifier on MLPClassifier. I am using this for classifying text data where X is set of questions in pd.DataFrame and Y is multi-label and multi-class strings. Please see below the code snippets
text_clf = Pipeline([('scale',StandardScaler(with_mean=False)),('clf',OneVsRestClassifier(MLPClassifier(learning_rate = 'adaptive', solver = 'lbfgs',random_state=9000)))])
parameters = {'clf__alpha':[10.0 ** ~ np.arange(1, 7).any()],'clf__hidden_layer_sizes': [(100,),(50,)],'clf__max_iter': [1000,500],'clf__activation':('relu','tanh')}
grid = GridSearchCV(text_clf, parameters, cv=3, n_jobs=-1, scoring= 'accuracy')
with parallel_backend('threading'):
grid.fit(X,Y)
I am getting the following error
ValueError: Invalid parameter activation for estimator OneVsRestClassifier(estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='adaptive',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=9000, shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False),
n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.
As per my understanding MLPClassifier supports Multi Label classification. Is it indicating that parameters need to be re-examined? If so, then can any body please give any clue on where to make changes in parameters?
Any help would really be appreciated.
Your MLPClassifier is nested into OneVsRestClassifier as its estimator.
In other words, parameters should specify that all of alpha, hidden_layer_sizes, ... are meant for nested estimator not OneVsRestClassifier.
Changing your parameters like following should do the job:
parameters = {'clf__estimator__alpha':[10.0 ** ~ np.arange(1,7).any()],
'clf__estimator__hidden_layer_sizes': [(100,),(50,)],
'clf__estimator__max_iter': [1000,500],
'clf__estimator__activation':('relu','tanh')}
Related
I'm trying to use GridSearchCV to optimise the hyperparameters in a custom model built with Keras. My code so far:
https://pastebin.com/ujYJf67c#9suyZ8vM
The model definition:
def build_nn_model(n, hyperparameters, loss, metrics, opt):
model = keras.Sequential([
keras.layers.Dense(hyperparameters[0], activation=hyperparameters[1], # number of outputs to next layer
input_shape=[n]), # number of features
keras.layers.Dense(hyperparameters[2], activation=hyperparameters[3]),
keras.layers.Dense(hyperparameters[4], activation=hyperparameters[5]),
keras.layers.Dense(1) # 1 output (redshift)
])
model.compile(loss=loss,
optimizer = opt,
metrics = metrics)
return model
and the grid search:
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
epochs = [10, 50, 100]
param_grid = dict(epochs=epochs, optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', n_jobs=-1, refit='boolean')
grid_result = grid.fit(X_train, y_train)
throws an error:
TypeError: Cannot clone object '<keras.engine.sequential.Sequential object at 0x0000028B8C50C0D0>' (type <class 'keras.engine.sequential.Sequential'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.
How can I get GridSearchCV to play nicely with the model as it's defined?
I'm assuming you are training a classifier, so you have to wrap it in KerasClassifier:
from scikeras.wrappers import KerasClassifier
...
model = KerasClassifier(build_nn_model)
# Do grid search
Remember to provide for each of build_nn_model's parameters either a default value or a grid in GridSearchCV.
For a regression model use KerasRegressor instead.
I am using the sklearn RandomForestClassifier as my classification. I could not figure out how to get evaluate Overfitting and Underfitting for sklearn models.
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
model.fit(X_train, y_train)
Currently, I am using other metrics to evaluate my model like - cross_val_score, confusion_matrix, classification_report, PermutationImportance. Could someone please help me with this.
There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.
I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly.
training_scores = defaultdict(list)
validation_f1_scores = defaultdict(list)
validation_precision_scores = defaultdict(list)
validation_recall_scores = defaultdict(list)
validation_scores = defaultdict(list)
def model_1(seed, X, Y):
np.random.seed(seed)
scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
model = KNeighborsClassifier(n_neighbors=13)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
scores = model_selection.cross_validate(model, X, Y, cv=kfold, scoring=scoring, return_train_score=True)
print(scores['train_accuracy'])
training_scores['KNeighbour'].append(scores['train_accuracy'])
print(scores['test_f1_macro'])
validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
print(scores['test_precision_macro'])
validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
print(scores['test_recall_macro'])
validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
print(scores['test_accuracy'])
validation_scores['KNeighbour'].append(scores['test_accuracy'])
print(np.mean(training_scores['KNeighbour']))
print(np.std(training_scores['KNeighbour']))
#rest of print statments
It seems that for loop in the second model is redundant.
def model_2(seed, X, Y):
np.random.seed(seed)
scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
model = KNeighborsClassifier(n_neighbors=13)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
for train, test in kfold.split(X, Y):
scores = model_selection.cross_validate(model, X[train], Y[train], cv=kfold, scoring=scoring, return_train_score=True)
print(scores['train_accuracy'])
training_scores['KNeighbour'].append(scores['train_accuracy'])
print(scores['test_f1_macro'])
validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
print(scores['test_precision_macro'])
validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
print(scores['test_recall_macro'])
validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
print(scores['test_accuracy'])
validation_scores['KNeighbour'].append(scores['test_accuracy'])
print(np.mean(training_scores['KNeighbour']))
print(np.std(training_scores['KNeighbour']))
# rest of print statments
I am using StratifiedKFold and I am not sure if I need for loop as in model_2 function or does cross_validate function already use the split as we are passing cv=kfold as an argument.
I am not calling fit method, is this OK? Does cross_validate calls that automatically or do I need to call fit before calling cross_validate?
Finally, how can I create confusion matrix? Do I need to create it for each fold, if yes, how can the final/average confusion matrix be calculated?
The documentation is arguably your best friend in such questions; from the simple example there it should be apparent that you should use neither a for loop nor a call to fit. Adapting the example to use KFold as you do:
from sklearn.model_selection import KFold, cross_validate
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model = DecisionTreeRegressor()
scoring=('r2', 'neg_mean_squared_error')
cv_results = cross_validate(model, X, y, cv=kf, scoring=scoring, return_train_score=False)
cv_results
Result:
{'fit_time': array([0.00901461, 0.00563478, 0.00539804, 0.00529385, 0.00638533]),
'score_time': array([0.00132656, 0.00214362, 0.00134897, 0.00134444, 0.00176597]),
'test_neg_mean_squared_error': array([-11.15872549, -30.1549505 , -25.51841584, -16.39346535,
-15.63425743]),
'test_r2': array([0.7765484 , 0.68106786, 0.73327311, 0.83008371, 0.79572363])}
how can I create confusion matrix? Do I need to create it for each fold
No one can tell you if you need to create a confusion matrix for each fold - it is your choice. If you choose to do so, it may be better to skip cross_validate and do the procedure "manually" - see my answer in How to display confusion matrix and report (recall, precision, fmeasure) for each cross validation fold.
if yes, how can the final/average confusion matrix be calculated?
There is no "final/average" confusion matrix; if you want to calculate anything further than the k ones (one for each k-fold) as described in the linked answer, you need to have available a separate validation set...
model_1 is correct.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’, return_estimator=False, error_score=’raise-deprecating’)
where
estimator is an object implementing ‘fit’. It will be called to fit the model on the train folds.
cv: is a cross-validation generator that is used to generated train and test splits.
If you follow the example in the sklearn docs
cv_results = cross_validate(lasso, X, y, cv=3, return_train_score=False)
cv_results['test_score']
array([0.33150734, 0.08022311, 0.03531764])
You can see that the model lasso is fitted 3 times once for each fold on train splits and also validated 3 times on test splits. You can see that the test score on validation data are reported.
Cross validation of Keras models
Keras provides wrapper which makes the keras models compatible with sklearn cross_validatation method. You have to wrap the keras model using KerasClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold, cross_validate
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
def get_model():
model = Sequential()
model.add(Dense(2, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=get_model, epochs=10, batch_size=8, verbose=0)
kf = KFold(n_splits=3, shuffle=True)
X = np.random.rand(10,2)
y = np.random.rand(10,1)
cv_results = cross_validate(model, X, y, cv=kf, return_train_score=False)
print (cv_results)
In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler(), is this the correct way to apply it to test set as well?
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.
What happens can be described as follows:
Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data
Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
[2] Introduction to Machine Learning with Python
[3] 3.1 Cross-validation: evaluating estimator performance
I am learning how to develop a Backpropagation Neural Network using scikit-learn. I still confuse with how to implement k-fold cross validation in my neural network. I wish you guys can help me out. My code is as follow:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
f = open("seeds_dataset.txt")
data = np.loadtxt(f)
X=data[:,0:]
y=data[:,-1]
kf = KFold(n_splits=10)
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(5, 2), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
Do not split your data into train and test. This is automatically handled by the KFold cross-validation.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
for train_indices, test_indices in kf.split(X):
clf.fit(X[train_indices], y[train_indices])
print(clf.score(X[test_indices], y[test_indices]))
KFold validation partitions your dataset into n equal, fair portions. Each portion is then split into test and train. With this, you get a fairly accurate measure of the accuracy of your model since it is tested on small portions of fairly distributed data.
In case you are looking for already built in method to do this, you can take a look at cross_validate.
from sklearn.model_selection import cross_validate
model = MLPClassifier()
cv_results = cross_validate(model, X, Y, cv=10,
return_train_score=False,
scoring=model.score)
print("Fit scores: {}".format(cv_results['test_score']))
The thing I like about this approach is it gives you access to the fit_time, score_time, and test_score. It also allows you to supply your choice of scoring metrics and cross-validation generator/iterable (i.e. Kfold). Another good resource is Cross Validation.
Kudos to #COLDSPEED's answer.
If you'd like to have the prediction of n fold cross-validation, cross_val_predict() is the way to go.
# Scamble and subset data frame into train + validation(80%) and test(10%)
df = df.sample(frac=1).reset_index(drop=True)
train_index = 0.8
df_train = df[ : len(df) * train_index]
# convert dataframe to ndarray, since kf.split returns nparray as index
feature = df_train.iloc[:, 0: -1].values
target = df_train.iloc[:, -1].values
solver = MLPClassifier(activation='relu', solver='adam', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1, verbose=True)
y_pred = cross_val_predict(solver, feature, target, cv = 10)
Basically, the option cv indicates how many cross-validation you'd like to do in the training. y_pred is the same size as target.