fit randomForest model with cross validation - python

I want to cross validate a random forest model. I did this:
but i didn't know how to fit it.
classifier= RandomForestClassifier(n_estimators=100, random_state=0)
from sklearn.model_selection import cross_val_score
val = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
Know to fil the model with cross validation shall i do
val.fit(X, y)

You don't need to fit, you should have the scores of the 5 folds in val

Related

K-fold Cross Validation Implementation

i am currently using a LGBM regressor model to predict Unit Sales of products across different stores and i want to perform K-fold Cross Validation.
I am using the following code:
model_lgb_base=lgb.LGBMRegressor(objective='regression',)
model_lgb_base.fit(x_train,y_train)
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=10, random_state=42, shuffle=False)
accuracy = cross_val_score(estimator=model_lgb_base, X = x_train, y = y_train, cv=10, scoring = 'r2')
print(accuracy)
accuracy.std()
#Predicting the test set results
y_pred = model_lgb_base.predict(x_valid)
[0.77916146 0.78209908 0.78333999 0.78286847 0.7831663 0.78029406
0.78534152 0.77421018 0.78810446 0.77937227]
Am I using this correctly? I want to train the model using 10 fold cross validation and return the best model.

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

Checking for Overfitting and Underfitting in sklearn models

I am using the sklearn RandomForestClassifier as my classification. I could not figure out how to get evaluate Overfitting and Underfitting for sklearn models.
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
model.fit(X_train, y_train)
Currently, I am using other metrics to evaluate my model like - cross_val_score, confusion_matrix, classification_report, PermutationImportance. Could someone please help me with this.
There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

How to use SMAPE evaluation metric on train dataset?

I am using SMAPE (Symmetric mean absolute percentage error) evaluation metric.
Formula: https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error
def smape(A, F):
return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
I am using above function for calculating SMAPE.
Now I am trying to evaluate my model using SMAPE above code but I am not able to understand how to use it on train dataset for evaluation and then predict values for test dataset.
My code:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Train and test data split 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Establish model
model = RandomForestRegressor(n_jobs=-1)
model.fit(X_train, y_train)
Now how to use SMAPE with above randomforest regressor ? Should I use model.score i.e model.score(X_test, y_test) or model.smape(X_test, y_test)
If I use model.score(X_test, y_test) I am getting -0.4678402626438 score. Please suggest me how to use SMAPE metric with my random forest regressor model.
After model.fit(X_train, y_train):
y_pred = model.predict(x_test)
print(smape(y_test,y_pred))

How to run SVC classifier after running 10-fold cross validation in sklearn?

I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Categories

Resources