I am trying to predict Boston Housing prices. When I choose polynomial regression degree 1 or 2, R2 score is OK. But 3rd degree decreases R2 score.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)
y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))
from sklearn.metrics import r2_score
print('Prediction Score is: ', r2_score(y_test, y_pred))
Output (degree=2):
Prediction Score is: 0.6903318065831567
Output (degree=3):
Prediction Score is: -12898.308114085281
It is called overfitting the model.What you are doing is fitting the model perfectly on the training set that will lead to high variance.When you fit your hypothesis well on the training set it will then fail on the test set. You can check your r2_score for your training set using r2_score(X_train,y_train). It will be high. You need to balance the trade-off between bias and variance.
You can try other regression models like lasso and ridge and can play with their alpha value in case you are looking for a high r2_score. For better understanding, I am putting up an image that will show how hypothesis line gets affected on increasing the degree of the polynomial.
Related
I saw this post that trained a LGBM model but i would like to know how to adapt it for Lasso. I know the prediction may not necessarily be between 0 and 1, but I would like to try this model. I have tried this but it doesnt work:
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import Lasso
X, y = make_classification(n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8)
class Lasso(Lasso):
def predict(self,X, threshold=0.5):
result = super(Lasso, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = Lasso(alpha=0.05)
clf.fit(X,y)
precision = cross_val_score(Lasso(),X,y,cv=5,scoring='precision')
I get
UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan
The specific model class you chose (Lasso()) is actually used for regression problems, as it minimizes penalized square loss, which is not appropriate in your case. Instead, you can use LogisticRegression() with L1 penalty to optimize a logistic function with a Lasso penalty. To control the regularization strength, the C= parameter is used (see docs).
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X, y = make_classification(
n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8
)
class LassoLR(LogisticRegression):
def predict(self,X, threshold=0.5):
result = super(LassoLR, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = LassoLR(penalty='l1', solver='liblinear', C=1.)
clf.fit(X,y)
precision = cross_val_score(LassoLR(),X,y,cv=5,scoring='precision')
print(precision)
# array([0.04166667, 0.08163265, 0.1010101 , 0.125 , 0.05940594])
This is my dataset and Median_Price is my target variable
RMSE VALUE before and after using GridSearch CV parameter tuning is attached in the code. How Can I decrease the RMSE based on my dataset??
Dataset is to download from google drive here and also I had added a picture of the dataset for understanding.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')
dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)
dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)
dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)
X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]
y = dataset['Median_Price']
from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
scores = cross_val_score(model,
X_train,
y_train,
cv=5,
scoring='neg_mean_squared_error')
print('CV Mean: ', np.mean(scores))
print('STD: ', np.std(scores))
print('\n')
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# get cross val scores
get_cv_scores(regressor)
from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)
# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]
Dataset CSV download link
Well, there seems to be a certain decrease in the RMSE value after using GridSearchCV.
You can try out the feature selection, feature engineering, scale your data, transformations, try some other algorithms, these might help you decrease your RMSE value to some extent.
Also, the RMSE value depends completely on the context of data. Seems your data points are separated far from each other which is giving you very high RMSE value. The different techniques I mentioned above can help you to decrease RMSE only to a limited extent.
I want to get a confidence interval of the result of a linear regression. I'm working with the boston house price dataset.
I've found this question:
How to calculate the 99% confidence interval for the slope in a linear regression model in python?
However, this doesn't quite answer my question.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# import the data
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
plt.scatter(Y_test, y_test_predict)
plt.show()
How can I get, for instance, the 95% or 99% confidence interval from this? Is there some sort of in-built function or piece of code?
If you're looking to compute the confidence interval of the regression parameters, one way is to manually compute it using the results of LinearRegression from scikit-learn and numpy methods.
The code below computes the 95%-confidence interval (alpha=0.05). alpha=0.01 would compute 99%-confidence interval etc.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)
conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)
Using the Boston housing dataset, the above code produces the dataframe below:
If this is too much manual code, you can always resort to the statsmodels and use its conf_int method:
import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)
Since it uses the same formula, it produces the same output as above.
Convenient wrapper function:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
def get_conf_int(alpha, lr, X=X_train, y=Y_train):
"""
Returns (1-alpha) 2-sided confidence intervals
for sklearn.LinearRegression coefficients
as a pandas DataFrame
"""
coefs = np.r_[[lr.intercept_], lr.coef_]
X_aux = X.copy()
X_aux.insert(0, 'const', 1)
dof = -np.diff(X_aux.shape)[0]
mse = np.sum((y - lr.predict(X)) ** 2) / dof
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
t_val = stats.t.isf(alpha/2, dof)
gap = t_val * np.sqrt(mse * var_params)
return pd.DataFrame({
'lower': coefs - gap, 'upper': coefs + gap
}, index=X_aux.columns)
# for 95% confidence interval; use 0.01 for 99%-CI.
alpha = 0.05
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
get_conf_int(alpha, lin_model, X_train, Y_train)
Stats reference
I am not sure if there is any in-built function for this purpose, but what I do is create a loop on n no. of times and compare the accuracy of all the models and save the model with highest accuracy with pickle and use reuse it later.
Here goes the code:
for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("confidence_interval.pickle", "wb") as f:
pickle.dump(linear, f)
print("The best Accuracy: ", best)
You can always make changes to the given variables as I know the variables that you have provided varies greatly from mine. and if you want to predict the class possibilities you can use predict_proba. Refer to this link for difference between predict and predict_proba https://www.kaggle.com/questions-and-answers/82657
I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.
This is the code I built to apply a multiple linear regression. I added standard scaler to fix the Y intercept p-value which was not significant but the problem that the results of CV RMSE in the end changed and have nosense anymore and received an error in the code for plotting the correlation Matrix saying : AttributeError: 'numpy.ndarray' object has no attribute 'corr'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from scipy.stats.stats import pearsonr
# Import Excel File
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\Multiple_Linear_Regression\\SP Level Reasons Excels\\SP000273701_PL14_IPC_03_09_2018_Reasons.xlsx",'Sheet1') #Import Excel file
# Replace null values of the whole dataset with 0
data1 = data.fillna(0)
print(data1)
# Extraction of the independent and dependent variables
X = data1.iloc[0:len(data1),[1,2,3,4,5,6,7]] #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),9] #Extract the column of the PAUS SP
# Data Splitting to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.25,random_state=1)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Statistical Analysis of the training set with Statsmodels
X = sm.add_constant(X_train) # add a constant to the model
est = sm.OLS(Y_train, X).fit()
print(est.summary()) # print the results
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm = LinearRegression() # create an lm object of LinearRegression Class
lm.fit(X_train,Y_train) # train our LinearRegression model using the training set of data - dependent and independent variables as parameters. Teaching lm that Y_train values are all corresponding to X_train.
print(lm.intercept_)
print(lm.coef_)
mse_test = mean_squared_error(Y_test, lm.predict(X_test))
print(math.sqrt(mse_test))
# Data Splitting to train and test set of the reduced data
X_1 = data1.iloc[0:len(data1),[1,2]] #Extract the column of the COPCOR SP we are going to check its impact
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_1, Y, test_size =0.25,random_state=1)
X_train2 = ss.fit_transform(X_train2)
X_test2 = ss.transform(X_test2)
# Statistical Analysis of the reduced model with Statsmodels
X_reduced = sm.add_constant(X_train2) # add a constant to the model
est_reduced = sm.OLS(Y_train2, X_reduced).fit()
print(est_reduced.summary()) # print the results
# Fitting a Linear Model for the reduced model with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm1 = LinearRegression() #create an lm object of LinearRegression Class
lm1.fit(X_train2, Y_train2)
print(lm1.intercept_)
print(lm1.coef_)
mse_test1 = mean_squared_error(Y_test2, lm1.predict(X_test2))
print(math.sqrt(mse_test1))
#Cross Validation and Training again the model
from sklearn.model_selection import KFold
from sklearn import model_selection
kf = KFold(n_splits=6, random_state=1)
for train_index, test_index in kf.split(X_train2):
print("Train:", train_index, "Validation:",test_index)
X_train1, X_test1 = X[train_index], X[test_index]
Y_train1, Y_test1 = Y[train_index], Y[test_index]
results = -1 * model_selection.cross_val_score(lm1, X_train1, Y_train1,scoring='neg_mean_squared_error', cv=kf)
print(np.sqrt(results))
#RMSE values interpretation
print(math.sqrt(mse_test1))
print(math.sqrt(results.mean()))
#Good model built no overfitting or underfitting (Barely Same for test and training :Goal of Cross validation but low prediction accuracy = Value is big
import seaborn
Corr=X_train2.corr(method='pearson')
mask=np.zeros_like(Corr)
mask[np.triu_indices_from(mask)]=True
seaborn.heatmap(Corr,cmap='RdYlGn_r',vmax=1.0,vmin=-1.0,mask=mask, linewidths=2.5)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
enter code here
Do you have an idea how to fix the issue ?
I'm guessing the problem lies with:
Corr=X_train2.corr(method='pearson')
.corr is a pandas dataframe method but X_train2 is a numpy array at that stage. If a dataframe/series is passed into StandardScaler, a numpy array is returned. Try replacing the above with:
Corr=pd.DataFrame(X_train2).corr(method='pearson')
or make use of numpy.corrcoef or numpy.correlate in their respective forms.