I want to create the Negative Predictive Value (NPV) metric to evaluate Data inside a gridsearchCV.
I prepared an example with the iris dataset.
from sklearn import svm, datasets
import numpy as np
import pandas as pd
from sklearn.metrics import make_scorer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
target = iris.target
names = iris.target_names
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target
#df['species'] = df['species'].replace(to_replace= [0, 1, 2], value = ['setosa', 'versicolor', 'virginica'])
indexNames = df[ df['species'] == 2 ].index
df.drop(indexNames , inplace=True)
x_data = pd.DataFrame({'sepal length': df[df.columns[0]],
'sepal width': df[df.columns[1]],
'petal length': df[df.columns[2]],
'petal width': df[df.columns[3]]})
y_data = pd.DataFrame(df['species']).astype(float)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.5)
parameters = {
'kernel':('linear', 'rbf'),
'C':[1, 10]
}
svc = svm.SVC()
scoring = ['accuracy']
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
grid_search = GridSearchCV(estimator=svc,
param_grid=parameters,
scoring=scoring,
refit='accuracy',
n_jobs=-1,
cv=kfold,
verbose=0)
grid_result = grid_search.fit(x_train, y_train.values.ravel())
grid_result
print(f'best precision score train set {grid_result.best_score_:.4f}')
print(f'best hyperparameters{grid_result.best_params_}')
print(f'precision score test set{grid_search.score(x_test, y_test):.4f}')
Unfortunately the exemplaric code is not working, accuracy is always 1. Thats one issue i got when i adapted my real code to an example with iris data.
Another Code works, but its for accuracy only and i need a binary classifier for metrics like precision or NPV.
from sklearn import svm, datasets
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_val_score
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
scoring = ['accuracy']
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
grid_search = GridSearchCV(estimator=svc,
param_grid=parameters,
scoring=scoring,
refit='accuracy',
n_jobs=-1,
cv=kfold,
verbose=0)
grid_result = grid_search.fit(iris.data, iris.target)
print(f'best accuracy score {grid_result.best_score_:.4f}')
print(f'best hyperparameters{grid_result.best_params_}')
print(f'accuracy score{grid_search.score(iris.data, iris.target):.4f}')
To sum up, whys is the first code block not working like intended and how would an implementation with the metric NPV look like?
Ideas are welcome.
Also, in other classification tasks, I wrote columns with my predicted value and I could compute TP, FP, TN, and FN directly and got my metrics. Inside the gridCV I couldn't figure out how to access the predictions.
In the first code above, the data is really well separated. To remove doubts, print the confusion_matrix in it, you can see that this is so.
from sklearn.metrics import confusion_matrix
y_pred = grid_search.best_estimator_.predict(x_test)
print(confusion_matrix(y_test, y_pred))
Output
[[22 0]
[ 0 28]]
As for custom criteria, you can use: make_scorer. In which the custom_scorer function is fed, which considers what you want. I took the confusion_matrix and extracted the values from it to calculate the score. What she thinks, namely the formula, you need to check, because I can be wrong.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
def custom_scorer(y_true, y_pred, **kwargs):
cnm = confusion_matrix(y_true, y_pred)
return cnm[1, 1] / (cnm[1, 1] + cnm[0, 1])
mysc = make_scorer(custom_scorer, greater_is_better=True)
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
grid_search = GridSearchCV(estimator=svc,
param_grid=parameters,
scoring=mysc,
refit='accuracy',
n_jobs=-1,
cv=kfold,
verbose=0)
grid_result = grid_search.fit(x_train, y_train.values.ravel())
Related
I got error while using SVM and MLP classifiers from SkLearn package. The error is C:\Users\cse_s\anaconda3\lib\site-packages\sklearn\metrics_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Code for splitting dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
Code for SVM classifier
from sklearn import svm
SVM_classifier = svm.SVC(kernel="rbf", probability = True, random_state=1)
SVM_classifier.fit(X_train, y_train)
SVM_y_pred = SVM_classifier.predict(X_test)
print(classification_report(y_test, SVM_y_pred))
Code for MLP classifier
from sklearn.neural_network import MLPClassifier
MLP = MLPClassifier(random_state=1, learning_rate = "constant", learning_rate_init=0.3, momentum = 0.2 )
MLP.fit(X_train, y_train)
R_y_pred = MLP.predict(X_test)
target_names = ['No class', 'Yes Class']
print(classification_report(y_test, R_y_pred, target_names=target_names))
The error is same for both classifiers
I hope, it could help.
Classification_report:
Sets the value to return when there is a zero division. You can provide 0 or 1 if zero division occur. by the precision or recall formula
classification_report(y_test, R_y_pred, target_names=target_names, zero_division=0)
I don't know what's your data look like. Here's an example
Features of cancer dataset
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
cancer = load_breast_cancer()
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df_feat.head()
Target of dataset:
df_target = pd.DataFrame(cancer['target'],columns=['Cancer'])
np.ravel(df_target) # convert it into a 1-d array
Generate classification report:
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.3, random_state=101)
SVM_classifier = svm.SVC(kernel="rbf", probability = True, random_state=1)
SVM_classifier.fit(X_train, y_train)
SVM_y_pred = SVM_classifier.predict(X_test)
print(classification_report(y_test, SVM_y_pred))
Generate classification report for MLP Classifier:
MLP = MLPClassifier(random_state=1, learning_rate = "constant", learning_rate_init=0.3, momentum = 0.2 )
MLP.fit(X_train, y_train)
R_y_pred = MLP.predict(X_test)
target_names = ['No class', 'Yes Class']
print(classification_report(y_test, R_y_pred, target_names=target_names, zero_division=0))
My dataset is Spam and Ham Filipino Message
I divided my dataset into 60% training, 20% testing and 20%validation
Split data into testing, training and Validation
from sklearn.model_selection import train_test_split
data['label'] = (data['label'].replace({'ham' : 0,
'spam' : 1}))
X_train, X_test, y_train, y_test = train_test_split(data['message'],
data['label'], test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))
Train a MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
naive_bayes = MultinomialNB().fit(train_data,
y_train)
predictions = naive_bayes.predict(test_data)
Evaluate the Model
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
accuracy_score = accuracy_score(y_test,
predictions)
precision_score = precision_score(y_test,
predictions)
recall_score = recall_score(y_test,
predictions)
f1_score = f1_score(y_test,
predictions)
My problem is in Validation. The error says
warnings.warn("Estimator fit failed. The score on this train-test"
this is how I code my validation, don't know if I'm doing the right thing"
from sklearn.model_selection import cross_val_score
mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
I did not get any error or warning. Maybe it can be worked.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("https://raw.githubusercontent.com/jeffprosise/Machine-Learning/master/Data/ham-spam.csv")
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
x = vectorizer.fit_transform(df['Text'])
y = df['IsSpam']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))
naive_bayes = MultinomialNB().fit(X_train, y_train)
predictions = naive_bayes.predict(X_test)
accuracy_score = accuracy_score(y_test,predictions)
precision_score = precision_score(y_test, predictions)
recall_score = recall_score(y_test, predictions)
f1_score = f1_score(y_test, predictions)
mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
Result:
Total: 1000 rows
Train: 600 rows
Test: 200 rows
Validation: 200 rows
Cross-validation scores:[1. 0.95 0.85 1. 1. 0.9 0.9 0.8 0.9 0.9 ]
First, it is worth noting that because it's called cross validation doesn't mean you have to use a validation set as you have done in your code, to do the crossval. There are a number of reasons why you would perform cross validation which include:
Ensuring that all your dataset is used in training as well as evaluating the performance of your model
To perform hyperparameter tuning.
Hence, your case here lean toward the first use case. As such you don't need to first perform a split of train, val, and test. Instead you can perform the 10-fold cross validation on your entire dataset.
If you are doing hyparameterization, then you can have a hold-out set of say 30% and use the remaining 70% for cross validation. Once the best parameters have been determined, you can then use the hold-out set to perform an evaluation of the model with the best parameters.
Some refs:
https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79
https://www.analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
Ideally I should get same result as score is nothing but R-Square. But not sure why results are coming different.
from sklearn.datasets import california_housing
data = california_housing.fetch_california_housing()
data.data.shape
data.feature_names
data.target_names
import pandas as pd
house_data = pd.DataFrame(data.data, columns=data.feature_names)
house_data.describe()
house_data['Price'] = data.target
X = house_data.iloc[:, 0:8].values
y = house_data.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
#Check R-square on training data
from sklearn.metrics import mean_squared_error, r2_score
y_pred = linear_model.predict(X_test)
print(linear_model.score(X_test, y_test))
print(r2_score(y_pred, y_test))
Output
0.5957643114594776
0.34460597952465033
from the docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
sklearn.metrics.r2_score(y_true, y_pred,...)
You are passing y_true and y_pred the wrong way around. If you switch them you get the correct result.
print(linear_model.score(X_test, y_test))
print(r2_score(y_test, y_pred))
0.5957643114594777
0.5957643114594777
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
X = data['Review']
y = data['Category']
tfidf = TfidfVectorizer(ngram_range=(1,1))
classifier = LinearSVC()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
clf = Pipeline([
('tfidf', tfidf),
('clf', classifier)
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
accuracy_score(y_test, y_pred)
This is the code to train a model and prediction. I need to know my model performance. so where should I change to become cross_val_score?
use this:(it is an example from my previous project)
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
kfolds = KFold(n_splits=5, shuffle=True, random_state=42)
def cv_f1(model, X, y):
score = np.mean(cross_val_score(model, X, y,
scoring="f1",
cv=kfolds))
return (score)
model = ....
score_f1 = cv_f1(model, X_train, y_train)
you can have multiple scoring. you should just change scoring="f1".
if you want to see score for each fold just remove np.mean
from sklearn documentation
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
In your case it will be
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(scores)
For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051