how to visualize dependence of model performance & alpha with matplotlib? - python

I fit a Ridge Regression with GridSearchCV but am having trouble using matplotlib to show the model performance versus regularizer(alpha)
Could anyone please help?
My code:
from sklearn.datasets import fetch_california_housing
cal=fetch_california_housing()
X = cal.data
y = cal.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
import matplotlib.pyplot as plt
alphas = np.logspace(-3, 3, 13)
plt.semilogx(alphas, grid.fit(X_train, y_train), label='Train')
plt.semilogx(alphas, grid.fit(X_test, y_test), label='Test')
plt.legend(loc='lower left')
plt.ylim([0, 1.0])
plt.xlabel('alpha')
plt.ylabel('performance')
# the error code I got was "ValueError: x and y must have same first dimension"
Basically, I want to see the something like the following:

When plotting model selection performance resulting from using GridSearch, it's typical to plot the mean and standard deviation of test and training sets of the cross_validation folds.
Also care should be taken to identify which scoring criteria is to be used in the grid search to select the best model. this is typically R-squared for regression.
The grid search returns a dictionary (accessible through .cv_results_) containing the scores for each fold train/test scores as well as the time it took to train/test each fold. Also a summary of that data is included using the mean and the standard deviation.
PS. in newer version of pandas you'll need to include return_train_score=True
PS.S. when using grid search, splitting the data to train/test is not necessary for model selection, because the grid search splits the data automatically (cv=10 means that the data is split to 10 folds)
given the above I modified the code to
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_california_housing
cal = fetch_california_housing()
X = cal.data
y = cal.target
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid,
cv=10, return_train_score=True, scoring='r2')
grid.fit(X, y)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
alphas = np.logspace(-3, 3, 13)
train_scores_mean = grid.cv_results_["mean_train_score"]
train_scores_std = grid.cv_results_["std_train_score"]
test_scores_mean = grid.cv_results_["mean_test_score"]
test_scores_std = grid.cv_results_["std_test_score"]
plt.figure()
plt.title('Model')
plt.xlabel('$\\alpha$ (alpha)')
plt.ylabel('Score')
# plot train scores
plt.semilogx(alphas, train_scores_mean, label='Mean Train score',
color='navy')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color='navy')
plt.semilogx(alphas, test_scores_mean,
label='Mean Test score', color='darkorange')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color='darkorange')
plt.legend(loc='best')
plt.show()
The resulting figure is shown below

you should plot scores, not the result of grid.fit().
First of all use return_train_score=True:
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10, return_train_score=True)
then after fitting the model plot it as follows:
plt.semilogx(alphas, grid.cv_results_['mean_train_score'], label='Train')
plt.semilogx(alphas, grid.cv_results_['mean_test_score'], label='Test')
plt.legend()
Result:

Related

How to solve ValueError: x and y must be the same size issue on Python?

I'm trying to do a linear regression, however I keep running into the same problem of "ValueError: x and y must be the same size". I'm very confused, and have been on every single website there is to try to fix it. If anyone would know that would be a massive help. I don't understand what to do.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
#load datatset
df = pd.read_csv('Real_estate.csv')
X = df[['transaction date', 'house age', 'distance to the nearest MRT station','number of convenience stores', 'latitude','longitude']]
y = df['house price of unit area']
x= df.iloc[:,0:-7].values
y= df.iloc[:,1:].values
x, y = np.array(x), np.array(y)
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.4)
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
regr = linear_model.LinearRegression()
regr.fit(x_train_std, y_train)
y_pred = regr.predict(x_test)
r_sq = model.score(x, y)
print("Intercept: ", regr.intercept_)
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
##Model evaluation
print("Mean absolute error: %.2f" % mean_absolute_error(y_test,y_pred))
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
y_pred = model.predict(x)
print('predicted response:', y_pred, sep='\n')
plt.scatter(x_test,y_test, color="black")
plt.plot(x_test, y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is my code but I don't know understand what's going wrong. I'm trying to use 7 columns, including the y value. I'm a beginner to Python, so I apologize if this is a very silly question. Thank you.
plt.plot(x_test, y_pred, color="blue", linewidth=3)
Both arguments need to be of the same shape, but y_pred is prediction over entire x, instead of x_test
change
y_pred = model.predict(x)
to
y_pred = model.predict(x_test)

Why is the polynomial regression returning the same results for different grades?

I have this dataframe and I want to calculate the polynomial regression for ozone. I pass o3 as y value, and the dates as x value. Why does my polynomial regression look the same for grade 2 to 15? I have compared grade 4 to grade 15 and there is no difference... I have compared the obtained regressions to CurveExpert software, and they are entirely different... How to solve the problems and to view differences between grade 4 and 15?
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv')
dataset['day'] = pd.to_datetime(dataset['day'], dayfirst=True)
dataset = dataset.sort_values(by=['readable time'])
print(dataset.head())
group_by_df = pd.DataFrame([name, group.mean()["o3"]] for name, group in dataset.groupby('day'))
group_by_df.columns = ['day', "o3"]
group_by_df['day'] = pd.to_datetime(group_by_df['day'])
group_by_df['day'] = group_by_df['day'].map(dt.datetime.toordinal)
X = group_by_df[['day']].values
y = group_by_df[['o3']].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualizing the Linear Regression results
def viz_linear():
plt.scatter(X, y, color='red')
plt.plot(X, lin_reg.predict(X), color='blue')
plt.title('Linear Regression')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_linear()
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('poly Regression grade 15')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_polymonial()
you are so close. Nice job, you have a lot going on here.
I think you want to fit the test sets like this for linear:
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_test, y_test)
and like this for Polynomial:
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X_test)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y_test)
now the curves show visually much different

what should be the parameters in residplot seaborn

I have made a simple linear regression model:
LR = LinearRegression()
kfold = model_selection.KFold(n_splits=10, random_state=12)
result_kfold = model_selection.cross_val_score(LR, X_train, Y_train, cv=kfold, scoring = 'r2')
print("Accuracy: %.2f%%" % (result_kfold.mean()*100.0))
LR.fit(X_train,Y_train)
Y_pred = LR.predict(X_test)
print("Y_pred:", Y_pred)
i want to plot the residual errors. I've used 'residplot' for the same. But i'm not sure if i've passed the right arguements. According to the documentation, we've to use predictor variable and result/response variable.
Here's the code:
sns.set(style="whitegrid")
sns.residplot(Y_test, Y_pred, lowess=True, color="g")
Can anyone please tell me if it is right...also what should be the labels of X and Y axis?
Thank You in advance for help
You are plotting something very weird, so let's use an example dataset:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib. pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
X_train, X_test, Y_train, Y_test = train_test_split(iris.iloc[:,:3], iris.iloc[:,3],random_state=11)
LR = LinearRegression()
LR.fit(X_train,Y_train)
Y_pred = LR.predict(X_test)
If you just want to plot the residuals, you can do:
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize =(5,5))
sns.regplot(x=Y_pred,y=Y_test-Y_pred,ax=ax,lowess=True)
ax.set(ylabel='residuals',xlabel='fitted values')
What you are getting with sns.regplot() is the y variable regressed onto the x-variable and the residuals being plotted, which makes no sense in your case, and I illustrate below how the plot is obtained, first you fit the prediction (y variable) to actual (x variable), and get the residuals:
plotfit = LinearRegression()
plotfit.fit(Y_test.to_numpy().reshape(-1,1),Y_pred)
residual = Y_pred - plotfit.predict(Y_test.to_numpy().reshape(-1,1))
Then plotting it gives you exactly the same thing as your sns.residplot:
sns.set(style="whitegrid")
fig, ax = plt.subplots(1,2,figsize =(10,5))
sns.residplot(Y_test,Y_pred,lowess=True, color="g",ax=ax[0])
ax[0].set_xlim(0,2.5)
sns.regplot(x=Y_test,y=residual,lowess=True)
ax[1].set_xlim(0,2.5)

How to draw best fit plane for multi variant regression in scikit learn?

I am not software back ground yet i am learning regression technique to predict motor data.
I have 3d data for which i have used multi variant regression.
Result is fine. But now i want to visualize the best fir plane for this data.
following are the code which i copied paste from different site to try to visualize my data.
X_final=df3[['Ampere','Voltage']]
y_final=df3[['ReactivePower']].copy() #copy column data in to y_final
X_final=X_final.dropna()
y_final=y_final.dropna()
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
lr = LinearRegression().fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
#print score
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print('lr train score %.3f, lr test score: %.3f' % (
lr.score(X_train,y_train),
lr.score(X_test, y_test)))
# Visualize the Data for Multiple Linear Regression
x_surf, y_surf = np.meshgrid(np.linspace(df3.Voltage.min(), df3.Voltage.max()),np.linspace(df3.Ampere.min(), df3.Ampere.max()))
y_train_pred_random= y_train_pred[np.random.choice(y_train_pred.shape[0], 2500, replace=False), :]
y_train_pred_random=np.array(y_train_pred_random)
y_train_pred1=y_train_pred_random.reshape(x_surf.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df3['Voltage'],df3['Ampere'],df3['ReactivePower'],c='red', marker='o', alpha=0.5)
ax.plot_surface(x_surf,y_surf,y_train_pred1,rstride=1, cstride=1, color='b', alpha=0.3)
ax.set_xlabel('Voltage')
ax.set_ylabel('Ampere')
ax.set_zlabel('Reactive Power')
plt.show()
when i run code for visualization i get following graph,
Please help
yeah, i solved myself with some refrence online,
here is the code,
#Test train split mullti variant
X_final=df3[['Ampere','Voltage']]
y_final=df3[['ReactivePower']].copy() #copy column data in to y_final
X_final=X_final.dropna()
y_final=y_final.dropna()
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
lr = LinearRegression().fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
#print score
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print('lr train score %.3f, lr test score: %.3f' % (
lr.score(X_train,y_train),
lr.score(X_test, y_test)))
# Visualize the Data for Multiple Linear Regression
x_surf, y_surf = np.meshgrid(np.linspace(df3.Ampere.min(), df3.Ampere.max()),np.linspace(df3.Voltage.min(), df3.Voltage.max()))
z_surf=lr.coef_[0,0]*x_surf+lr.coef_[0,1]*y_surf+lr.intercept_
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df3['Ampere'],df3['Voltage'],df3['ReactivePower'],c='red', marker='o', alpha=0.5)
ax.plot_surface(x_surf,y_surf,z_surf,rstride=1, cstride=1, color='b', alpha=0.3)
ax.set_xlabel('Ampere')
ax.set_ylabel('Voltage')
ax.set_zlabel('Reactive Power')
plt.show()
Here is the plot,
Thanks,

Deviance loss scores on the training data don't match clf.train_score_

TL;DR: I'm trying to understand the meaning of the train_score_ attribute of a GradientBoostingClassifier, and specifically why it doesn't match my following attempt to calculate it directly:
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
More details: I'm interested in the loss scores for both the test and the train data during the different fit stages of the classifier. I can use staged_predict and loss_ to calculate the loss scores for the test data:
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
I'm okay with that. My problem is with the train loss scores. The documentation suggests to use clf.train_score_:
The i-th score train_score_[i] is the deviance (= loss) of the model
at iteration i on the in-bag sample. If subsample == 1 this is the
deviance on the training data.
yet these clf.train_score_ values do not match my attempt to calculate them directly in my_train_scores above. What am I missing here?
The code I used:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = GradientBoostingClassifier(n_estimators=5, loss='deviance')
clf.fit(X_train, y_train)
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
print test_scores
print clf.train_score_
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
print my_train_scores, '<= NOT the same values as in the previous line. Why?'
Producing e.g. this output...
[0.71319004170311229, 0.74985670836977902, 0.79319004170311214, 0.55385670836977885, 0.32652337503644546]
[ 1.369166 1.35366377 1.33780865 1.32352935 1.30866325]
[0.65541226392533436, 0.67430115281422309, 0.70807893059200089, 0.51096781948088987, 0.3078567083697788] <= NOT the same values as in the previous line. Why?
...where the last two rows do not match.
The attribute self.train_score_ is recreated in the following way:
test_dev = []
for i, pred in enumerate(clf.staged_decision_function(X_test)):
test_dev.append(clf.loss_(y_test, pred))
ax = plt.gca()
ax.plot(np.arange(clf.n_estimators) + 1, test_dev, color='#d7191c', label='Test', linewidth=2, alpha=0.7)
ax.plot(np.arange(clf.n_estimators) + 1, clf.train_score_, color='#2c7bb6', label='Train', linewidth=2, alpha=0.7, linestyle='--')
ax.set_xlabel('n_estimators')
plt.legend()
plt.show()
See the result below. Note that the curves are on top of each other as the training and test data are the same data.

Categories

Resources