I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).
These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.
Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?
Thank you
import pandas as pd
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]
Y=Data['Status']
Y1=Data['Status1'] # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)
from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())
from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))
# sensitivity:
print (metrics.recall_score(y, predicted) )
import matplotlib.pyplot as plt
probs = logreg.predict_proba(X)[:, 1]
plt.hist(probs)
plt.show()
# use 0.5 cutoff for predicting 'default'
import numpy as np
preds = np.where(probs > 0.5, 1, 0)
print (ConfusionMatrix(list(y), list(preds)))
# check accuracy, sensitivity, specificity
print (metrics.accuracy_score(y, predicted))
#ROC CURVES and AUC
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate)')
plt.show()
# calculate AUC
print (metrics.roc_auc_score(y, probs))
# use AUC as evaluation metric for cross-validation
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression()
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
You got it almost right. cross_validation.cross_val_predict gives you predictions for the entire dataset. You just need to remove logreg.fit earlier in the code. Specifically, what it does is the following:
It divides your dataset in to n folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1 folds). So, in the end you will get predictions for the entire data.
Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data'] is X and iris['target'] is y
In [15]: iris['data'].shape
Out[15]: (150, 4)
To get predictions on the entire set with cross validation you can do the following:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)
Out [1] : 0.9537
print metrics.classification_report(iris['target'], predicted)
Out [2] :
precision recall f1-score support
0 1.00 1.00 1.00 50
1 0.96 0.90 0.93 50
2 0.91 0.96 0.93 50
avg / total 0.95 0.95 0.95 150
So, back to your code. All you need is this:
from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted)
For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:
In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.
Related
I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train)
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.
Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
np.unique(y_pred)
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
Results:
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88
I've created a model is trained on the titanic dataset, and I want to see an accuracy percentage for my model. I've done this before, but sadly, I do not remember. I looked at the internet, and I couldn't find anything. Either I just entered the wrong words, or there isn't anything their.
# the tts function is `train_test_split` from `sklearn.model_selection`
train_X, val_X, train_y, val_y = tts(X, y, random_state = 0) # y is the state of survival
forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
val_predictions = forest_model.predict(val_X)
How can I calculate the accuracy?
I wonder why are you using RandomForestRegressor, as titanic dataset can be formulated as a binary-classification problem. Assuming it is a mistake, to measure accuracy you can of a RandomForestClassifier, you can do:
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(val_y, val_predictions)
However, it is often better to use K-fold cross-validation, which gives you more reliable accuracy:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(forest_model, X, y, cv=10, scoring='accuracy') #10-fold cross validation, notice that I am giving X,y not X_train, y_train
K-fold cross-validation gives you 10 accuracy values of accuracy, as it divides the data into 10 folds (i.e. parts). Then, you can get a mean and standard deviation the accuracy values as follows:
>>> import numpy as np
>>> accuracy = np.array(cross_val_score(forest_model, X, y, cv=10, scoring='accuracy'))
>>> accuracy.mean() #mean of accuracies
0.81
>>> accuracy.std() #standard deviation of accuracies
0.04
You can also use other scoring metric such as F1-score, Precision,
Recall,cohen_kappa_score etc.
You can calculate the model score to know the performance accuracy
forest_model.score(val_y, val_predictions)
I am trying to predict Boston Housing prices. When I choose polynomial regression degree 1 or 2, R2 score is OK. But 3rd degree decreases R2 score.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)
y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))
from sklearn.metrics import r2_score
print('Prediction Score is: ', r2_score(y_test, y_pred))
Output (degree=2):
Prediction Score is: 0.6903318065831567
Output (degree=3):
Prediction Score is: -12898.308114085281
It is called overfitting the model.What you are doing is fitting the model perfectly on the training set that will lead to high variance.When you fit your hypothesis well on the training set it will then fail on the test set. You can check your r2_score for your training set using r2_score(X_train,y_train). It will be high. You need to balance the trade-off between bias and variance.
You can try other regression models like lasso and ridge and can play with their alpha value in case you are looking for a high r2_score. For better understanding, I am putting up an image that will show how hypothesis line gets affected on increasing the degree of the polynomial.
I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.
I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.
My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.
Please, advise me in the following:
(1) Is my code correct? Please, check each part.
(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data
clf.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
print(clf.best_params_)
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
print(confMatrix)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
print(classificationReport)
I will appreciate any advise.
First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.
Quick Example
Let me hereby introduce a toy example for better understanding:
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_
This code yields the following:
0.98107957707289928 # which is the best possible accuracy score
{'mean_fit_time': array([ 0.15465896, 0.23701136]),
'mean_score_time': array([ 0.0006465 , 0.00065773]),
'mean_test_score': array([ 0.934335 , 0.9376739]),
'mean_train_score': array([ 0.96475625, 0.98225632]),
'param_C': masked_array(data = [0.001 0.01],
'params': ({'C': 0.001}, {'C': 0.01})
mean_train_score has two mean values as I grid over two options for C parameter.
I hope that helps!