what should be the parameters in residplot seaborn - python

I have made a simple linear regression model:
LR = LinearRegression()
kfold = model_selection.KFold(n_splits=10, random_state=12)
result_kfold = model_selection.cross_val_score(LR, X_train, Y_train, cv=kfold, scoring = 'r2')
print("Accuracy: %.2f%%" % (result_kfold.mean()*100.0))
LR.fit(X_train,Y_train)
Y_pred = LR.predict(X_test)
print("Y_pred:", Y_pred)
i want to plot the residual errors. I've used 'residplot' for the same. But i'm not sure if i've passed the right arguements. According to the documentation, we've to use predictor variable and result/response variable.
Here's the code:
sns.set(style="whitegrid")
sns.residplot(Y_test, Y_pred, lowess=True, color="g")
Can anyone please tell me if it is right...also what should be the labels of X and Y axis?
Thank You in advance for help

You are plotting something very weird, so let's use an example dataset:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib. pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
X_train, X_test, Y_train, Y_test = train_test_split(iris.iloc[:,:3], iris.iloc[:,3],random_state=11)
LR = LinearRegression()
LR.fit(X_train,Y_train)
Y_pred = LR.predict(X_test)
If you just want to plot the residuals, you can do:
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize =(5,5))
sns.regplot(x=Y_pred,y=Y_test-Y_pred,ax=ax,lowess=True)
ax.set(ylabel='residuals',xlabel='fitted values')
What you are getting with sns.regplot() is the y variable regressed onto the x-variable and the residuals being plotted, which makes no sense in your case, and I illustrate below how the plot is obtained, first you fit the prediction (y variable) to actual (x variable), and get the residuals:
plotfit = LinearRegression()
plotfit.fit(Y_test.to_numpy().reshape(-1,1),Y_pred)
residual = Y_pred - plotfit.predict(Y_test.to_numpy().reshape(-1,1))
Then plotting it gives you exactly the same thing as your sns.residplot:
sns.set(style="whitegrid")
fig, ax = plt.subplots(1,2,figsize =(10,5))
sns.residplot(Y_test,Y_pred,lowess=True, color="g",ax=ax[0])
ax[0].set_xlim(0,2.5)
sns.regplot(x=Y_test,y=residual,lowess=True)
ax[1].set_xlim(0,2.5)

Related

How to solve ValueError: x and y must be the same size issue on Python?

I'm trying to do a linear regression, however I keep running into the same problem of "ValueError: x and y must be the same size". I'm very confused, and have been on every single website there is to try to fix it. If anyone would know that would be a massive help. I don't understand what to do.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
#load datatset
df = pd.read_csv('Real_estate.csv')
X = df[['transaction date', 'house age', 'distance to the nearest MRT station','number of convenience stores', 'latitude','longitude']]
y = df['house price of unit area']
x= df.iloc[:,0:-7].values
y= df.iloc[:,1:].values
x, y = np.array(x), np.array(y)
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.4)
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
regr = linear_model.LinearRegression()
regr.fit(x_train_std, y_train)
y_pred = regr.predict(x_test)
r_sq = model.score(x, y)
print("Intercept: ", regr.intercept_)
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
##Model evaluation
print("Mean absolute error: %.2f" % mean_absolute_error(y_test,y_pred))
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
y_pred = model.predict(x)
print('predicted response:', y_pred, sep='\n')
plt.scatter(x_test,y_test, color="black")
plt.plot(x_test, y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is my code but I don't know understand what's going wrong. I'm trying to use 7 columns, including the y value. I'm a beginner to Python, so I apologize if this is a very silly question. Thank you.
plt.plot(x_test, y_pred, color="blue", linewidth=3)
Both arguments need to be of the same shape, but y_pred is prediction over entire x, instead of x_test
change
y_pred = model.predict(x)
to
y_pred = model.predict(x_test)

Scatter Plot of predicted vs actual value with regression curve

I am trying to use scatter plots with regression curves using the following code. I am using different algorithms like Linear regression, SVM and Gaussian Process etc. I have tried different options for plotting the data mentioned below
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.gaussian_process import GaussianProcessRegressor
df=pd.read_excel(coded.xlsx)
dfnew=df[['FL','FW','TL','LL','KH']]
Y = df['KH']
X = df[['FL']]
X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)
# Split the data into training/testing sets
X_train = X[:-270]
X_test = X[-270:]
# Split the targets into training/testing sets
Y_train = Y[:-270]
Y_test = Y[-270:]
#regressor = SVR(kernel = 'rbf')
#regressor.fit(X_train, np.ravel(Y_train))
#training the algorithm
regressor = GaussianProcessRegressor(random_state=42)
regressor.fit(X_train, Y_train)
y_pred = regressor.predict(X_test)
mse = np.sum((y_pred - Y_test)**2)
# root mean squared error
# m is the number of training examples
rmse = np.sqrt(mse/270)
print(rmse)
#X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
#X_grid = np.arange(0, 15, 0.01) #this step required because data is feature scaled.
#X_grid = X_grid.reshape((len(X_grid), 1))
#plt.scatter(X, Y, color = 'red')
print('size of Y_train = {0}'.format(Y_train.size))
print('size of y_pred = {0}'.format(y_pred.size))
#plt.scatter(Y_train, y_pred, color = 'red')
#plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
#plt.title('GPR')
#plt.xlabel('Measured')
#plt.ylabel('Predicted')
#plt.show()
fig, ax = plt.subplots(1, figsize=(12, 6))
plt.plot(X[:, 0], Y_train, marker='o', color='black', linewidth=0)
plt.plot(X[:, 0], y_pred, marker='x', color='steelblue')
plt.suptitle("$GaussianProcessRegressor(kernel=RBF)$ [default]", fontsize=20)
plt.axis('off')
pass
But I am getting error like:
ValueError: x and y must have same first dimension, but have shapes (540,) and (270, 1)
What is the possible solution?
This code splits X and Y into training/testing sets, but then tries to plot a column from all of X with Y_train and y_pred, which have only half as many values as X. Try creating plots with X_train and X_test instead.

Draw figures using k-nn with different values of k

I want to plot figures with different value of k for k-nn classifier.
My problem is that the figures seem to have same values of k.
What I have tried so far, is to change the value of k in each run in the loop:
clf = KNeighborsClassifier(n_neighbors=counter+1)
But all the figures seem to be for k=1
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
import numpy as np
from sklearn.model_selection import train_test_split
c = np.array([1 if y > np.median(data['target']) else 0 for y in data['target']])
X_train, X_test, c_train, c_test = train_test_split(data['data'], c, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
import mglearn
import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
for counter in range(3):
clf = KNeighborsClassifier(n_neighbors=counter+1)
clf.fit(X_test, c_test)
plt.tight_layout() # this will help create proper spacing between the plots.
mglearn.discrete_scatter(X_test[:,0], X_test[:,1], c_test, ax=ax[counter])
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")
#plt.figure()
The reason why all the plots look the same is that you are simply plotting the test set every time instead of plotting the model predictions on the test set. You probably meant to do the following for each value of k:
Fit the model to the training set, in which case you should replace clf.fit(X_test, c_test) with clf.fit(X_train, c_train).
Generate the model predictions on the test set, in which case you should add c_pred = clf.predict(X_test).
Plot the model predictions on the test set, in which case you should replace c_test with c_pred in the scatter plot, i.e. use mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], c_pred, ax=ax[counter]) instead of mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], c_test, ax=ax[counter]).
Updated code:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import mglearn
import matplotlib.pyplot as plt
data = fetch_california_housing()
c = np.array([1 if y > np.median(data['target']) else 0 for y in data['target']])
X_train, X_test, c_train, c_test = train_test_split(data['data'], c, random_state=0)
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
for counter in range(3):
clf = KNeighborsClassifier(n_neighbors=counter+1)
# fit the model to the training set
clf.fit(X_train, c_train)
# extract the model predictions on the test set
c_pred = clf.predict(X_test)
# plot the model predictions
plt.tight_layout()
mglearn.discrete_scatter(X_test[:,0], X_test[:,1], c_pred, ax=ax[counter])
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")

Why is the polynomial regression returning the same results for different grades?

I have this dataframe and I want to calculate the polynomial regression for ozone. I pass o3 as y value, and the dates as x value. Why does my polynomial regression look the same for grade 2 to 15? I have compared grade 4 to grade 15 and there is no difference... I have compared the obtained regressions to CurveExpert software, and they are entirely different... How to solve the problems and to view differences between grade 4 and 15?
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv')
dataset['day'] = pd.to_datetime(dataset['day'], dayfirst=True)
dataset = dataset.sort_values(by=['readable time'])
print(dataset.head())
group_by_df = pd.DataFrame([name, group.mean()["o3"]] for name, group in dataset.groupby('day'))
group_by_df.columns = ['day', "o3"]
group_by_df['day'] = pd.to_datetime(group_by_df['day'])
group_by_df['day'] = group_by_df['day'].map(dt.datetime.toordinal)
X = group_by_df[['day']].values
y = group_by_df[['o3']].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualizing the Linear Regression results
def viz_linear():
plt.scatter(X, y, color='red')
plt.plot(X, lin_reg.predict(X), color='blue')
plt.title('Linear Regression')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_linear()
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('poly Regression grade 15')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_polymonial()
you are so close. Nice job, you have a lot going on here.
I think you want to fit the test sets like this for linear:
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_test, y_test)
and like this for Polynomial:
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X_test)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y_test)
now the curves show visually much different

Plot multiple confusion matrices with plot_confusion_matrix

I am using plot_confusion_matrix from sklearn.metrics. I want to represent those confusion matrices next to each other like subplots, how could I do this?
Let's use the good'ol iris dataset to reproduce this, and fit several classifiers to plot their respective confusion matrices with plot_confusion_matrix:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
from sklearn.metrics import plot_confusion_matrix
data = load_iris()
X = data.data
y = data.target
Set up -
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifiers = [LogisticRegression(solver='lbfgs'),
AdaBoostClassifier(),
GradientBoostingClassifier(),
SVC()]
for cls in classifiers:
cls.fit(X_train, y_train)
So the way you could compare all matrices at simple sight, is by creating a set of subplots with plt.subplots. Then iterate both over the axes objects and the trained classifiers (plot_confusion_matrix expects the as input) and plot the individual confusion matrices:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15,10))
for cls, ax in zip(classifiers, axes.flatten()):
plot_confusion_matrix(cls,
X_test,
y_test,
ax=ax,
cmap='Blues',
display_labels=data.target_names)
ax.title.set_text(type(cls).__name__)
plt.tight_layout()
plt.show()
if your desired output is that This is my way to see multiple confusion matrices (confusion_matrix) side by side with ConfusionMatrixDisplay.
note: paste your own test and train data names in "metrics.confusion_matrix()" function.
fig, ax = plt.subplots(1,2)
ax[0].set_title("test")
ax[1].set_title("train")
metrics.ConfusionMatrixDisplay(
confusion_matrix = metrics.confusion_matrix(y_test, y_pred),
display_labels = [False, True]).plot(ax=ax[0])
metrics.ConfusionMatrixDisplay(
confusion_matrix = metrics.confusion_matrix(y_train, y_train_pred),
display_labels = [False, True]).plot(ax=ax[1]);

Categories

Resources