I know how to draw confusion matrix when I use the train and test split using sklearn but I do not know how to create the confusion matrix when I am using the leave-one-out cross validation as shown in this example:
# Evaluate using Leave One Out Cross Validation
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
loocv = model_selection.LeaveOneOut()
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
How should I create the confusion matrix for LOOCV in order to visualize the per-class accuracy?
Borrowing your method from here, you can work around the problem via creating a custom scorer that receives the metadata during the iterations. These metadata can be used to find: F1 Score, Precision, Recall, Accuracy as well as the Confusion Matrix!
Here we need another trick that is using GridSearchCV which accepts a custom scorer, so here we go!
Here is an example that you can work on more according to your absolute requirements:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Your method from the link you provided
def cm_analysis(y_true, y_pred, labels, ymap=None, figsize=(10,10)):
if ymap is not None:
y_pred = [ymap[yi] for yi in y_pred]
y_true = [ymap[yi] for yi in y_true]
labels = [ymap[yi] for yi in labels]
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_sum = np.sum(cm, axis=1, keepdims=True)
cm_perc = cm / cm_sum.astype(float) * 100
annot = np.empty_like(cm).astype(str)
nrows, ncols = cm.shape
for i in range(nrows):
for j in range(ncols):
c = cm[i, j]
p = cm_perc[i, j]
if i == j:
s = cm_sum[i]
annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
elif c == 0:
annot[i, j] = ''
else:
annot[i, j] = '%.1f%%\n%d' % (p, c)
cm = pd.DataFrame(cm, index=labels, columns=labels)
cm.index.name = 'Actual'
cm.columns.name = 'Predicted'
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(cm, annot=annot, fmt='', ax=ax)
#plt.savefig(filename)
plt.show()
# Custom Scorer
def my_scorer(y_true, y_pred):
acc = accuracy_score(y_true, y_pred)
# you can either save y_true, y_pred and accuracy into a file
# for later use with the info in clf.cv_results_
# or plot the confusion matrix right here!
# for labels, you can create a class attribute to make it more dynamic
# i.e. changes automatically with every new dataset!
cm_analysis(y_true, y_pred, labels=[0,1], ymap=None, figsize=(10, 10))
# N.B as long as you have y_true and y_pred from every round here, you can
# do with them all the metrics that want such as F1 Score, Precision, Recall, A
# ccuracy and the Confusion Matrix!
return acc
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
array = df.values
X = np.array(array[:,0:8])
Y = np.array(array[:,8]).astype(int)
# I'll make it two just for submitting the result here!
num_folds = 2
skf = StratifiedKFold(n_splits=num_folds, random_state=0)
# this is just a trick because the list contains
# the default parameter only (i.e. useless)
param_grid = {'C': [1.0]}
model = LogisticRegression()
# create custom scorer
custom_scorer = make_scorer(my_scorer)
# pass it to the GridSearchCV
clf = GridSearchCV(model, param_grid, scoring=custom_scorer, cv=skf, return_train_score=True)
# Fit and Go
clf.fit(X,Y)
# cv_results_ is a dict with all CV results during the iterations!
# IDK, you may need it to combine its content with the metrics ..etc
print(clf.cv_results_)
Result
{'mean_score_time': array([0.09023476]), 'split0_train_score':
array([0.79166667]), 'mean_train_score': array([0.77864583]),
'params': [{'C': 1.0}], 'std_test_score': array([0.01953125]),
'mean_fit_time': array([0.00235796]),
'param_C': masked_array(data=[1.0], mask=[False], fill_value='?',
dtype=object), 'rank_test_score': array([1], dtype=int32),
'split1_test_score': array([0.7734375]),
'std_fit_time': array([0.00032902]), 'mean_test_score': array([0.75390625]),
'std_score_time': array([0.00237632]), 'split1_train_score': array([0.765625]),
'split0_test_score': array([0.734375]), 'std_train_score': array([0.01302083])}
Split 0
Split 1
EDIT
If you strictly want LOOCV, then you can apply it in the above code, just replace StratifiedKFold by LeaveOneOut function; but bear in mind that LeaveOneOut will iterate around 684 times! so it's computationally very expensive. However, that would give you the confusion matrix in details during the iterations (i.e. metadata).
Nevertheless, if you are seeking the confusion matrix of the overall (i.e. final) process then you will still need to use the GridSearchCV but like follow:
......
loocv = LeaveOneOut()
clf = GridSearchCV(model, param_grid, scoring='accuracy', cv=loocv)
clf.fit(X,Y)
y_pred = clf.best_estimator_.predict(X)
cm_analysis(Y, y_pred, labels=[0, 1], ymap=None, figsize=(10,10))
Result
Related
I have create this Support Vector machine model but I get errors in "Fit the PCA transformer on the training data and transform the data line" and "model = svm.SVC(kernel='linear')"
The first error is:
NameError: name 'x_train' is not defined
The second error is:
NameError: name 'svm' is not defined
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import seaborn as sns
import seaborn as sns # for data visualization
train_df = pd.read_csv('diabetes_data_upload.csv')
train_df.head()
# checking total of rows and columns
train_df.shape
# Transforming the Gender into 0 and 1
train_df["Gender"] = train_df["Gender"].map({"Male": 0, "Female": 1}).astype(int)
#Rounding the Age
train_df["Age"] = train_df.Age.round()
# Separating the data to predict the missing ages
X_train = train_df[train_df.Age.notnull()][['Age','Gender','weakness','Obesity', 'class']]
X_test = train_df[train_df.Age.isnull()][['Age','Gender','weakness','Obesity', 'class']]
y = train_df.Age.dropna()
# Just confirming if there is no more ages missing
train_df.Age.isnull().sum()
# Taking only the features that is important for now
X = train_df[['Gender', 'Age', 'weakness']]
# Taking the labels
Y = train_df['class']
# Spliting into 80% for training set and 20% for testing set so we can see our accuracy
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
# Declaring the SVC with no tunning
print(X_train.shape)
print(Y_train.shape)
from sklearn.decomposition import PCA
# Create a PCA transformer with 3 components
pca = PCA(n_components=3)
# Fit the PCA transformer on the training data and transform the data
x_train_pca = pca.fit_transform(x_train)
# Transform the test data using the PCA transformer fitted on the training data
x_test_pca = pca.transform(x_test)
# Fit the classifier on the transformed training data
classifier.fit(x_train_pca, y_train)
# Predict the labels for the transformed test data
predictions = classifier.predict(x_test_pca)
# Calculate the accuracy of the model
accuracy = classifier.score(x_test_pca, y_test)
print("Accuracy:", accuracy)
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()
The two errors you mention can be solved by the following -
1. NameError: name 'x_train' is not defined
This is because you are using x_train instead of the X_train variable you have defined right above. Remember, variables names are case sensitive.
x_train_pca = pca.fit_transform(x_train) # your code
x_train_pca = pca.fit_transform(X_train) # change it to this
2. NameError: name 'svm' is not defined
This is because you are already importing the SVC class using from svm import SVC. But while trying to instantiating the class the model you are using svm.SVC.
model = svm.SVC(kernel='linear') # your code
model = SVC(kernel='linear') # change it to this
i'm using a kaggle dataset (https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) to make a prediction on house prices.
This is the code I used and so far so good.
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
#import dataset
dataset = pd.read_csv(path_to_dataset)
dataset.head()
dataset['date'] = pd.to_datetime(dataset['date']) #convert date in datetime
#house_age is a new feature
dataset["house_age"] = dataset["date"].dt.year - dataset['yr_built']
#drop useful features
dataset=dataset.drop('date', axis=1)
dataset=dataset.drop('yr_built', axis=1)
dataset = dataset.drop(["id"],axis=1)
train, test = train_test_split(dataset, test_size=0.3, random_state=43)
xtrain = train.drop(['price'], axis = 1) #train array without price
ytrain = train['price'] #train array with price
xtest = test.drop(['price'], axis = 1) #test array without price
ytest = test['price'] #test array with price
reg = RandomForestRegressor()
reg.fit(xtrain,ytrain)
pred = reg.predict(xtest)
print("Score: ",r2_score(ytrain, reg.predict(xtrain)))
print("Score: ",r2_score(ytest, pred))
print('MSE: ', metrics.mean_squared_error(ytest, pred))
Now, however, I would like to calculate and draw a confidence interval for the predictions made with my model.
I have already tried to look at many articles and libraries for several hours but I have not yet been able to find a solution that works for my case.
These are a couple of the references I followed but with little success:
http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html#sphx-glr-auto-examples-plot-mpg-py
https://scikit-garden.github.io/examples/QuantileRegressionForests/#quantile-regression-forests_1
Does anyone know how to create a confidence interval for this situation?
To construct confidence intervals, you can use the quantile-forest package. Using the RandomForestQuantileRegressor method in the package, you can specify quantiles to estimate during training, which can then be used to construct intervals.
Here's an example that extends your code with the above package to do this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from quantile_forest import RandomForestQuantileRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
# import dataset
dataset = pd.read_csv(path_to_dataset)
dataset.head()
dataset['date'] = pd.to_datetime(dataset['date']) # convert date in datetime
# house_age is a new feature
dataset['house_age'] = dataset["date"].dt.year - dataset['yr_built']
# drop useful features
dataset = dataset.drop('date', axis=1)
dataset = dataset.drop('yr_built', axis=1)
dataset = dataset.drop(['id'], axis=1)
train, test = train_test_split(dataset, test_size=0.3, random_state=43)
x_train = train.drop(['price'], axis=1) # train array without price
y_train = train['price'] # train array with price
x_test = test.drop(['price'], axis=1) # test array without price
y_test = test['price'] # test array with price
reg = RandomForestQuantileRegressor(n_estimators=100, random_state=0)
reg.fit(x_train, y_train)
# Get predictions at 95% prediction intervals and median.
y_pred = reg.predict(x_test, quantiles=[0.025, 0.5, 0.975])
def plot_intervals(y_true, y_pred_lower, y_pred_upper):
fig = plt.figure(figsize=(10, 4))
y_pred_interval = y_pred_upper - y_pred_lower
sort_idx = np.argsort(y_pred_interval)
y_true = y_true[sort_idx]
y_pred_lower = y_pred_lower[sort_idx]
y_pred_upper = y_pred_upper[sort_idx]
# Center data, with the mean of the prediction interval at 0.
mean = (y_pred_lower + y_pred_upper) / 2
y_true -= mean
y_pred_lower -= mean
y_pred_upper -= mean
plt.plot(y_true, marker='.', ms=5, c='r', lw=0)
plt.fill_between(
np.arange(len(y_pred_upper)),
y_pred_lower,
y_pred_upper,
alpha=0.2,
color='gray',
)
plt.plot(np.arange(len(y_true)), y_pred_lower, marker='_', c='0.2', lw=0)
plt.plot(np.arange(len(y_true)), y_pred_upper, marker='_', c='0.2', lw=0)
plt.xlim([0, len(y_true)])
plt.xlabel('Ordered Samples')
plt.ylabel('Observed Values and Prediction Intervals (Centered)')
plt.show()
plot_intervals(y_test.values, y_pred[:, 0], y_pred[:, 2])
print('Score: ', r2_score(y_train, reg.predict(x_train)))
print('Score: ', r2_score(y_test, y_pred[:, 1]))
print('MSE: ', mean_squared_error(y_test, y_pred[:, 1]))
The code plots the generated intervals from smallest to largest along with the observed values:
I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.
The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output?
I'm new to Shap value and the Shap package. Thanks a lot in advance.
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]
shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
TL;DR:
You can achieve plotting results in probability space with link="logit" in the force_plot method:
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
shap.initjs()
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :],
link="logit",
)
Expected output:
Alternatively, you may achieve the same with the following, explicitly specifying model_output="probability" you're interested in to explain:
explainer = shap.TreeExplainer(
model,
data=X_train,
feature_perturbation="interventional",
model_output="probability",
)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
shap_value = shap_values.values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :]
)
Expected output:
However, it might be more interesting for understanding what's happening here to find out where these figures come from:
Our target proba for the point of interest:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
Base case raw from model given X_train as background (note, LightGBM outputs raw for class 1):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
Base case raw from SHAP (note, they are symmetric):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519, 2.48397519])
Raw SHAP values for the point of interest:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584, 3.40619584])
Proba inferred from SHAP values (via sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
Check:
assert np.allclose(model_proba, shap_proba)
Please ask questions if something is not clear.
Side notes
Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.
Some people expect to see SHAP values in probability space as well, but this is not feasible because:
SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
exp(a + b) != exp(a) + exp(b)
You may find useful:
Feature importance in a binary classification and extracting SHAP values for one of the classes only answer
How to interpret base_value of GBT classifier when using SHAP? answer
You can consider running your output values through a softmax() function. For reference, it is defined as :
def get_softmax_probabilities(x):
return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)
and there is a scipy implementation as well:
from scipy.special import softmax
The output from softmax() will be probabilities proportional to the (relative) values in vector x, which are your shop values.
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)
# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
# verification
def verification(index_number,class_idx):
print('-----------------------------------')
print('index_number: ', index_number)
print('class_idx: ', class_idx)
print('')
y_base = explainer.expected_value[class_idx]
print('y_base: ', y_base)
player_explainer = pd.DataFrame()
player_explainer['feature_value'] = X_train.iloc[j].values
player_explainer['shap_value'] = shap_values[class_idx][j]
print('verification: ')
print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
print('y_pred: %.2f'%(y_train[j]))
j = 10 # index
verification(j,0)
verification(j,1)
# show:
# X_train: (455, 30)
# X_test: (114, 30)
# -----------------------------------
# index_number: 10
# class_idx: 0
# y_base: -2.391423081639827
# verification:
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number: 10
# class_idx: 1
# y_base: 2.391423081639827
# verification:
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.
I helped you achieve it and verified the reliability of the results.
In Python, I have conducted a small multiple linear regression model to explain house prices in areas based on other variables (all of which are percentages multiplied by 100) such as percentage of people with bachelor degrees in an area, percentage of people who work from home. I have conducted this in R and it works fine, but I am new to Python ML. I have shown the output of y_pred = regressor.predict(X_test) and the MSE I get. I have included a sample of my data where avgincome PctSingleDetached and PctDrivetoWork are X, and AvgHousingPrice is the Y.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
sample data:
avgincome PctSingleDetached PctDrivetoWork AvgHousingPrice
0 44388.0 61.528497 81.151832 448954
1 40650.0 54.372197 77.882798 349758
2 43350.0 68.393782 79.553265 428740
X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer.
# Instructs to find missing and replace it with mean
# Fit method in SimpleImputer will connect imputer to our matrix of features
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))
print(X)
print(Y)
## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
### Feature Scaling ###
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both
X_test[:, 0:] = sc.transform(X_test[:, 0:])
print(X_train)
print(X_test)
## Training ##
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)
### Predicting Test Set results ###
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, y_pred)
print(mse)
OUTPUT:
[[489066.76 300334. ]
[227458.2 200352. ]
[928249.59 946729. ]
[339032.27 350116. ]
[689668.21 600322. ]
[489179.58 577936. ]]
...
...
MSE = 2375985640.8102403
You can calculate mse yourself to check if there is something wrong. In my opinion the obtained result is coherent. Anyway I built a simple my_mse function to check the result output by sklearn, with your example data
from sklearn.metrics import mean_squared_error
list_ = [[489066.76, 300334.],
[227458.2, 200352. ],
[928249.59, 946729. ],
[339032.27, 350116. ],
[689668.21, 600322. ],
[489179.58, 577936. ]]
y_true = [y[0] for y in list_]
y_pred = [y[1] for y in list_]
mse = mean_squared_error(y_true, y_pred)
print(mse)
# 8779930962.14985
def my_mse(y_true, y_pred):
diff = 0
for couple in zip(y_true, y_pred):
diff+=pow(couple[0]-couple[1], 2)
return diff/len(y_true)
print(my_mse(y_true, y_pred))
# 8779930962.14985
Remember the mse is the mean squared error. (Each error is squared in the sum)
If you are asking if your model is bad or good, it depends on the main objective. Anyway, I think that your model is performing poor because it's a linear model. A model with more complexity could handle the problem and output better results
I am currently trying to tune hyperparameters using GridSearchCV in scikit-learn using a 'Precision at k' scoring metric which will give me precision if I classify the top kth percentile of my classifier's score as the positive class. I know it is possible to create a custom scorer using make_scorer and creating a score function. This is what I have now:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
def precision_at_k(y_true, y_score, k):
df = pd.DataFrame({'true': y_true, 'score': y_score}).sort('score')
threshold = df.iloc[int(k*len(df)),1]
y_pred = pd.Series([1 if i >= threshold else 0 for i in df['score']])
return metrics.precision_score(y_true, y_pred)
custom_scorer = metrics.make_scorer(precision_at_k, needs_proba=True, k=0.1)
X = np.random.randn(100, 10)
Y = np.random.binomial(1, 0.3, 100)
train_index = range(0, 70)
test_index = range(70, 100)
train_x = X[train_index]
train_Y = Y[train_index]
test_x = X[test_index]
test_Y = Y[test_index]
clf = LogisticRegression()
params = {'C': [0.01, 0.1, 1, 10]}
clf_gs = GridSearchCV(clf, params, scoring=custom_scorer)
clf_gs.fit(train_x, train_Y)
However, attempting to call fit gives me Exception: Data must be 1-dimensional and I'm not sure why. Can anyone help? Thanks in advance.
Arguments for pd.DataFrame should be 'list' not 'numpy.arrays'
So, just try converting y_true to python list...
df = pd.DataFrame({'true': y_true.tolist(), 'score': y_score.tolist()}).sort('score')