Checking for Overfitting and Underfitting in sklearn models - python

I am using the sklearn RandomForestClassifier as my classification. I could not figure out how to get evaluate Overfitting and Underfitting for sklearn models.
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
model.fit(X_train, y_train)
Currently, I am using other metrics to evaluate my model like - cross_val_score, confusion_matrix, classification_report, PermutationImportance. Could someone please help me with this.

There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

Related

Scoring metric for multi-class cross validation

I have a DataFrame X in which there is a column called target with 10 different labels: [0,1,2,3,4,5,6,7,8,9]. I have a Machine Learning model, let's say: model=AdaBoostClassifier() I would like to use to fit the data and predict again the labels by doing a cross-validation process to train the model. I use two metrics for the cross-validation, the accuracy and the neg_mean_squared_error to evaluate the performance and compute the ratio: neg_mean_squared_error/accuracy. The lines are like:
model.seed = 42
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring=('accuracy', 'neg_mean_squared_error')
scores = cross_validate(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring=scoring)
scores = abs(np.sqrt(np.mean(scores['test_neg_mean_squared_error'])*-1))/np.mean(scores['test_accuracy'])
score_description = [model,'{model}'.format(model=model.__class__.__name__),"%0.5f" % scores]
However, whenever I start to run, I get the following error message:
ValueError: Samplewise metrics are not available outside of multilabel classification.
How could I do to solve this issue with the metrics and perform the corresponding classification? Which metrics could I use to evaluate the performance of the model in the multi-label case?

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

fit randomForest model with cross validation

I want to cross validate a random forest model. I did this:
but i didn't know how to fit it.
classifier= RandomForestClassifier(n_estimators=100, random_state=0)
from sklearn.model_selection import cross_val_score
val = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
Know to fil the model with cross validation shall i do
val.fit(X, y)
You don't need to fit, you should have the scores of the 5 folds in val

Python: Evaluating an Isolation Forest

I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.

How to run SVC classifier after running 10-fold cross validation in sklearn?

I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Categories

Resources