I am trying to use SKLearn's MLPRegressor model, however, I am unable to replicate the final loss value that the model is providing.
I have turned off regularisation by setting alpha to 0, and turned off early-stopping and validation. But my loss calculation differs from what SKLearn is providing by about 1%.
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
def squared_loss(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean() / 2
X, y = make_regression(n_samples=200, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
regr = MLPRegressor(random_state=1, max_iter=500, alpha=0, validation_fraction=0).fit(X_train, y_train)
# sklearn loss calc
regr.loss_ # 354.05262744103453
# manual loss calc
squared_loss(y_train, regr.predict(X_train)) # 350.25399153165534
Any pointers would be great, I've taken my squared_loss function directly from SKLearn docs and I can't find anything else that could be causing the difference when going through the code - but must be missing something. My guess is that there is some sample-size-related adjustment somewhere, because whatever value I change the first 'random_state' to, the difference between the two calculations is always around 1%.
Related
Hi everyone i have problem to print my result using MLPClassifier sklearn, i want my result is plot graphs mse vs epoch for training vs testing and this is my code :
#Spliting data into Feature and
x=Dt[['new_tests','people_vaccinated','people_fully_vaccinated','total_boosters','new_vaccinations']]
y=Dt[['new_cases','new_deaths']]
#Import train_test_split function**
from sklearn.model_selection import train_test_split
#Split dataset into training set and test set to be 70% and 30%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
# Import MLPClassifer and mse
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error
# Create model object
flc = MLPClassifier(hidden_layer_sizes=(15,10),
random_state=5,
verbose=True,
max_iter=50,
learning_rate_init=0.1)
mse = mean_squared_error(y_test, ypred, squared=False)
# Fit data onto the model
flc.fit(x_train, y_train)
# Make prediction on test dataset
ypred = flc.predict(x_test)
# Import accuracy score
from sklearn.metrics import accuracy_score
# Calcuate accuracy
accuracy_score(y_test,ypred)
I was expecting the graph display mse vs epoch for training and testing Please, is anyone able to run it? Any ideas or suggestion? What could be causing this issue? Thank you.
Mean square error is a regression metric. To visualize the loss per iteration, you can plot the clf.loss_curve_:
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
clf = MLPClassifier(max_iter=50).fit(X, y)
plt.plot(clf.loss_curve_)
plt.show()
If you want to plot something more complicated (e.g. error vs. epoch for training and testing), you'll to write a training loop like this: Training and validation loss history for MLPRegressor
I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train)
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.
Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
np.unique(y_pred)
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
Results:
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88
# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
The above code runs as expected. However, in order to explain the model, something like, 'a 1 unit change in Beta1 will result in a .7% improvement in probability of sucess' , I need to be able to see the theta. How do I do this?
Thanks for the assist. BTW, this is for a homework assignment
Very nice question. You can indeed access the thetas however in the documentation it is not clear how to do this.
Use the following. Here I use the iris dataset.
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
# thetas
gpc.kernel_.theta
Results:
array([7.1292252 , 1.35355145, 5.54106817, 0.61431805, 7.00063873,
1.3175175 ])
An example from the documentation that access thetas can be found HERE
Hope this helps.
It looks as if the theta value you are looking for is a property of the kernel object that you pass into the classifier. You can read more in this section of the sklearn documentation. You can access the log-transformed values of theta for the classifier kernel by using classifier.kernel_.theta where classifier is the name of your classifier object.
Note that the kernel object also has a method clone_with_theta(theta) that might come in handy if you're making modifications to theta.
I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.
My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.
Please, advise me in the following:
(1) Is my code correct? Please, check each part.
(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data
clf.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
print(clf.best_params_)
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
print(confMatrix)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
print(classificationReport)
I will appreciate any advise.
First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.
Quick Example
Let me hereby introduce a toy example for better understanding:
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_
This code yields the following:
0.98107957707289928 # which is the best possible accuracy score
{'mean_fit_time': array([ 0.15465896, 0.23701136]),
'mean_score_time': array([ 0.0006465 , 0.00065773]),
'mean_test_score': array([ 0.934335 , 0.9376739]),
'mean_train_score': array([ 0.96475625, 0.98225632]),
'param_C': masked_array(data = [0.001 0.01],
'params': ({'C': 0.001}, {'C': 0.01})
mean_train_score has two mean values as I grid over two options for C parameter.
I hope that helps!
I have a following code using linear_model.Lasso:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
clf = linear_model.Lasso()
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)
I want to perform k fold (10 times to be specific) cross_validation. What would be the right code to do that?
here is the code I use to perform cross validation on a linear regression model and also to get the details:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_Train, Y_Train, scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
As said in this book at page 108 this is the reason why we use -score:
Scikit-Learn cross-validation features expect a utility function
(greater is better) rather than a cost function (lower is better), so
the scoring function is actually the opposite of the MSE (i.e., a
negative value), which is why the preceding code computes -scores
before calculating the square root.
and to visualize the result use this simple function:
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
You can run 10-fold using the model_selection module:
# for 0.18 version or newer, use:
from sklearn.model_selection import cross_val_score
# for pre-0.18 versions of scikit, use:
from sklearn.cross_validation import cross_val_score
X = # Some features
y = # Some classes
clf = linear_model.Lasso()
scores = cross_val_score(clf, X, y, cv=10)
This code will return 10 different scores. You can easily get the mean:
scores.mean()