display a plot with linear regression sklearn - python

I have a dataset with X = ['x', 'y'] the two first columns of my dataset
and in target the data['class'].
But i doesn't how display a plot of linear regression in this case.
Because I have the error "x and y must be the same size".
So how i can plot a linear regression and predict with a dataset
or i take X as the first two column of my dataset and in target the last column ?
Thanks so much for the help, here my code below :
data = pd.read_csv('data.csv')
X = data[['x', 'y']]
data['class'] = np.where(data['class']=='P', 1, 0)
Y = data['class']
plt.scatter(X, Y, color='blue')
plt.xlabel('x')
plt.ylabel('y')
plt.plot(X, Y, color='red', linewidth=2)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Based on the offical documentation:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test) #adding your prediction, this was missing
import matplotlib.pyplot as plt
import numpy as np
# Plot outputs
plt.scatter(X_test, y_test, color='black') #plot scatters
plt.plot(X_test, y_pred, color='red', linewidth=2) #plot line
plt.xticks(())
plt.yticks(())
plt.show()

Related

How to solve ValueError: x and y must be the same size issue on Python?

I'm trying to do a linear regression, however I keep running into the same problem of "ValueError: x and y must be the same size". I'm very confused, and have been on every single website there is to try to fix it. If anyone would know that would be a massive help. I don't understand what to do.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
#load datatset
df = pd.read_csv('Real_estate.csv')
X = df[['transaction date', 'house age', 'distance to the nearest MRT station','number of convenience stores', 'latitude','longitude']]
y = df['house price of unit area']
x= df.iloc[:,0:-7].values
y= df.iloc[:,1:].values
x, y = np.array(x), np.array(y)
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.4)
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
regr = linear_model.LinearRegression()
regr.fit(x_train_std, y_train)
y_pred = regr.predict(x_test)
r_sq = model.score(x, y)
print("Intercept: ", regr.intercept_)
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
##Model evaluation
print("Mean absolute error: %.2f" % mean_absolute_error(y_test,y_pred))
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
y_pred = model.predict(x)
print('predicted response:', y_pred, sep='\n')
plt.scatter(x_test,y_test, color="black")
plt.plot(x_test, y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is my code but I don't know understand what's going wrong. I'm trying to use 7 columns, including the y value. I'm a beginner to Python, so I apologize if this is a very silly question. Thank you.
plt.plot(x_test, y_pred, color="blue", linewidth=3)
Both arguments need to be of the same shape, but y_pred is prediction over entire x, instead of x_test
change
y_pred = model.predict(x)
to
y_pred = model.predict(x_test)

Training and Testing Model PollyPlot

I'm trying to plot a Polynomial Plot with Matplotlib/Seaborn. I am new to Data Science and thus I'm having trouble with this bit of code:
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
width = 12
height = 10
plt.figure(figsize=(width, height))
#training data
#testing data
# lr: linear regression object
#poly_transform: polynomial transformation object
xmax=max([xtrain.values.max(), xtest.values.max()])
xmin=min([xtrain.values.min(), xtest.values.min()])
x=np.arange(xmin, xmax, 0.1)
plt.plot(xtrain, y_train, 'ro', label='Training Data')
plt.plot(xtest, y_test, 'go', label='Test Data')
plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
plt.ylim([-10000, 60000])
plt.ylabel('Price')
plt.legend()
This is the function that plots the polynomial function. However when I call the function with:
PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly, pr)
I get the following error:
InvalidIndexError: (slice(None, None, None), None)
Any assistance given would be greatly appreciated.
I have done 2 changes:
I've removed from the function the .values. to calculate the max and min because I will convert the xtrain and xtest to numpy:
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
width = 12
height = 10
plt.figure(figsize=(width, height))
#training data
#testing data
# lr: linear regression object
#poly_transform: polynomial transformation object
xmax=max([xtrain.max(), xtest.max()])
xmin=min([xtrain.min(), xtest.min()])
x=np.arange(xmin, xmax, 0.1)
plt.plot(xtrain, y_train, 'ro', label='Training Data')
plt.plot(xtest, y_test, 'go', label='Test Data')
plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
plt.ylim([-10000, 60000])
plt.ylabel('Price')
plt.legend()
To execute I have converted x_train and x_test to numpy:
PollyPlot(x_train[['horsepower']].to_numpy(), x_test[['horsepower']].to_numpy(), y_train.to_numpy(), y_test.to_numpy(), poly,pr)

Representation of a training and validation metric in a pipeline

I have a problem. I want to plot my RMSE value. However, I now use a pipeline because I use cross-validation and also use other steps like feature selection.
My question is, is there a way to get this plot through the pipeline (without training the model a second time)? So how can I display the training and validation RMSE value nicely in a diagram in the pipeline?
Pipeline
dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan}
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
'host_is_superhost'].map(d).astype('int')
X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)
steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=10000))),
('lasso', Lasso(alpha=0.4))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))
r2 = metrics.r2_score(y_test, y_pred)
print(r2)
Plot
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, 500 + 1):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.figure( figsize=(10,10))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
%%time
lin_reg = Lasso(alpha=0.1)
plot_learning_curves(lin_reg, X, y)
#plt.axis([0, 80, 0, 3])
plt.show()
You don't have to fit() your model again in plot_learning_curves. You can simply use your fitted pipeline to predict value for both train and validation set and then plot your learning curve.
You function should look as follow without the model.fit():
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, 500 + 1):
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.figure( figsize=(10,10))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
Then you should call this function using your fitted model as parameter.

How to draw best fit plane for multi variant regression in scikit learn?

I am not software back ground yet i am learning regression technique to predict motor data.
I have 3d data for which i have used multi variant regression.
Result is fine. But now i want to visualize the best fir plane for this data.
following are the code which i copied paste from different site to try to visualize my data.
X_final=df3[['Ampere','Voltage']]
y_final=df3[['ReactivePower']].copy() #copy column data in to y_final
X_final=X_final.dropna()
y_final=y_final.dropna()
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
lr = LinearRegression().fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
#print score
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print('lr train score %.3f, lr test score: %.3f' % (
lr.score(X_train,y_train),
lr.score(X_test, y_test)))
# Visualize the Data for Multiple Linear Regression
x_surf, y_surf = np.meshgrid(np.linspace(df3.Voltage.min(), df3.Voltage.max()),np.linspace(df3.Ampere.min(), df3.Ampere.max()))
y_train_pred_random= y_train_pred[np.random.choice(y_train_pred.shape[0], 2500, replace=False), :]
y_train_pred_random=np.array(y_train_pred_random)
y_train_pred1=y_train_pred_random.reshape(x_surf.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df3['Voltage'],df3['Ampere'],df3['ReactivePower'],c='red', marker='o', alpha=0.5)
ax.plot_surface(x_surf,y_surf,y_train_pred1,rstride=1, cstride=1, color='b', alpha=0.3)
ax.set_xlabel('Voltage')
ax.set_ylabel('Ampere')
ax.set_zlabel('Reactive Power')
plt.show()
when i run code for visualization i get following graph,
Please help
yeah, i solved myself with some refrence online,
here is the code,
#Test train split mullti variant
X_final=df3[['Ampere','Voltage']]
y_final=df3[['ReactivePower']].copy() #copy column data in to y_final
X_final=X_final.dropna()
y_final=y_final.dropna()
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
lr = LinearRegression().fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
#print score
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print('lr train score %.3f, lr test score: %.3f' % (
lr.score(X_train,y_train),
lr.score(X_test, y_test)))
# Visualize the Data for Multiple Linear Regression
x_surf, y_surf = np.meshgrid(np.linspace(df3.Ampere.min(), df3.Ampere.max()),np.linspace(df3.Voltage.min(), df3.Voltage.max()))
z_surf=lr.coef_[0,0]*x_surf+lr.coef_[0,1]*y_surf+lr.intercept_
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df3['Ampere'],df3['Voltage'],df3['ReactivePower'],c='red', marker='o', alpha=0.5)
ax.plot_surface(x_surf,y_surf,z_surf,rstride=1, cstride=1, color='b', alpha=0.3)
ax.set_xlabel('Ampere')
ax.set_ylabel('Voltage')
ax.set_zlabel('Reactive Power')
plt.show()
Here is the plot,
Thanks,

how to visualize dependence of model performance & alpha with matplotlib?

I fit a Ridge Regression with GridSearchCV but am having trouble using matplotlib to show the model performance versus regularizer(alpha)
Could anyone please help?
My code:
from sklearn.datasets import fetch_california_housing
cal=fetch_california_housing()
X = cal.data
y = cal.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
import matplotlib.pyplot as plt
alphas = np.logspace(-3, 3, 13)
plt.semilogx(alphas, grid.fit(X_train, y_train), label='Train')
plt.semilogx(alphas, grid.fit(X_test, y_test), label='Test')
plt.legend(loc='lower left')
plt.ylim([0, 1.0])
plt.xlabel('alpha')
plt.ylabel('performance')
# the error code I got was "ValueError: x and y must have same first dimension"
Basically, I want to see the something like the following:
When plotting model selection performance resulting from using GridSearch, it's typical to plot the mean and standard deviation of test and training sets of the cross_validation folds.
Also care should be taken to identify which scoring criteria is to be used in the grid search to select the best model. this is typically R-squared for regression.
The grid search returns a dictionary (accessible through .cv_results_) containing the scores for each fold train/test scores as well as the time it took to train/test each fold. Also a summary of that data is included using the mean and the standard deviation.
PS. in newer version of pandas you'll need to include return_train_score=True
PS.S. when using grid search, splitting the data to train/test is not necessary for model selection, because the grid search splits the data automatically (cv=10 means that the data is split to 10 folds)
given the above I modified the code to
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_california_housing
cal = fetch_california_housing()
X = cal.data
y = cal.target
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid,
cv=10, return_train_score=True, scoring='r2')
grid.fit(X, y)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
alphas = np.logspace(-3, 3, 13)
train_scores_mean = grid.cv_results_["mean_train_score"]
train_scores_std = grid.cv_results_["std_train_score"]
test_scores_mean = grid.cv_results_["mean_test_score"]
test_scores_std = grid.cv_results_["std_test_score"]
plt.figure()
plt.title('Model')
plt.xlabel('$\\alpha$ (alpha)')
plt.ylabel('Score')
# plot train scores
plt.semilogx(alphas, train_scores_mean, label='Mean Train score',
color='navy')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color='navy')
plt.semilogx(alphas, test_scores_mean,
label='Mean Test score', color='darkorange')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color='darkorange')
plt.legend(loc='best')
plt.show()
The resulting figure is shown below
you should plot scores, not the result of grid.fit().
First of all use return_train_score=True:
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10, return_train_score=True)
then after fitting the model plot it as follows:
plt.semilogx(alphas, grid.cv_results_['mean_train_score'], label='Train')
plt.semilogx(alphas, grid.cv_results_['mean_test_score'], label='Test')
plt.legend()
Result:

Categories

Resources