In Linear Regression Modeling why my RMSE Value is so large? - python

This is my dataset and Median_Price is my target variable
RMSE VALUE before and after using GridSearch CV parameter tuning is attached in the code. How Can I decrease the RMSE based on my dataset??
Dataset is to download from google drive here and also I had added a picture of the dataset for understanding.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')
dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)
dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)
dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)
X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]
y = dataset['Median_Price']
from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
scores = cross_val_score(model,
X_train,
y_train,
cv=5,
scoring='neg_mean_squared_error')
print('CV Mean: ', np.mean(scores))
print('STD: ', np.std(scores))
print('\n')
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# get cross val scores
get_cv_scores(regressor)
from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)
# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]
Dataset CSV download link

Well, there seems to be a certain decrease in the RMSE value after using GridSearchCV.
You can try out the feature selection, feature engineering, scale your data, transformations, try some other algorithms, these might help you decrease your RMSE value to some extent.
Also, the RMSE value depends completely on the context of data. Seems your data points are separated far from each other which is giving you very high RMSE value. The different techniques I mentioned above can help you to decrease RMSE only to a limited extent.

Related

How to output feature names with XGBOOST feature selection

My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)

Get confidence interval from sklearn linear regression in python

I want to get a confidence interval of the result of a linear regression. I'm working with the boston house price dataset.
I've found this question:
How to calculate the 99% confidence interval for the slope in a linear regression model in python?
However, this doesn't quite answer my question.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# import the data
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
plt.scatter(Y_test, y_test_predict)
plt.show()
How can I get, for instance, the 95% or 99% confidence interval from this? Is there some sort of in-built function or piece of code?
If you're looking to compute the confidence interval of the regression parameters, one way is to manually compute it using the results of LinearRegression from scikit-learn and numpy methods.
The code below computes the 95%-confidence interval (alpha=0.05). alpha=0.01 would compute 99%-confidence interval etc.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)
conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)
Using the Boston housing dataset, the above code produces the dataframe below:
If this is too much manual code, you can always resort to the statsmodels and use its conf_int method:
import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)
Since it uses the same formula, it produces the same output as above.
Convenient wrapper function:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
def get_conf_int(alpha, lr, X=X_train, y=Y_train):
"""
Returns (1-alpha) 2-sided confidence intervals
for sklearn.LinearRegression coefficients
as a pandas DataFrame
"""
coefs = np.r_[[lr.intercept_], lr.coef_]
X_aux = X.copy()
X_aux.insert(0, 'const', 1)
dof = -np.diff(X_aux.shape)[0]
mse = np.sum((y - lr.predict(X)) ** 2) / dof
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
t_val = stats.t.isf(alpha/2, dof)
gap = t_val * np.sqrt(mse * var_params)
return pd.DataFrame({
'lower': coefs - gap, 'upper': coefs + gap
}, index=X_aux.columns)
# for 95% confidence interval; use 0.01 for 99%-CI.
alpha = 0.05
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
get_conf_int(alpha, lin_model, X_train, Y_train)
Stats reference
I am not sure if there is any in-built function for this purpose, but what I do is create a loop on n no. of times and compare the accuracy of all the models and save the model with highest accuracy with pickle and use reuse it later.
Here goes the code:
for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("confidence_interval.pickle", "wb") as f:
pickle.dump(linear, f)
print("The best Accuracy: ", best)
You can always make changes to the given variables as I know the variables that you have provided varies greatly from mine. and if you want to predict the class possibilities you can use predict_proba. Refer to this link for difference between predict and predict_proba https://www.kaggle.com/questions-and-answers/82657

Polynomial Regression Degree Increasing Error

I am trying to predict Boston Housing prices. When I choose polynomial regression degree 1 or 2, R2 score is OK. But 3rd degree decreases R2 score.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)
y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))
from sklearn.metrics import r2_score
print('Prediction Score is: ', r2_score(y_test, y_pred))
Output (degree=2):
Prediction Score is: 0.6903318065831567
Output (degree=3):
Prediction Score is: -12898.308114085281
It is called overfitting the model.What you are doing is fitting the model perfectly on the training set that will lead to high variance.When you fit your hypothesis well on the training set it will then fail on the test set. You can check your r2_score for your training set using r2_score(X_train,y_train). It will be high. You need to balance the trade-off between bias and variance.
You can try other regression models like lasso and ridge and can play with their alpha value in case you are looking for a high r2_score. For better understanding, I am putting up an image that will show how hypothesis line gets affected on increasing the degree of the polynomial.

ValueError in kNN metrics

I have a project that consists of utilizing the kNN algorithm in a csv file and show selected metrics. But when I try to present some metrics it throws a few errors.
When trying to use: sensitivity, f1_Score and Precision:
sensitivity - print(metrics.recall_score(y_test, y_pred_class))
F1_score - print(metrics.f1_score(y_test, y_pred_class))
Presicion - print(metrics.precision_score(y_test, y_pred_class))
Pycharm throws the following error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting
The error when trying to print the ROC curve's a little different:
ValueError: multiclass format is not supported
DATASET
LINK TO DATASET: https://www.dropbox.com/s/yt3n1eqxlsb816n/Testfile%20-%20kNN.csv?dl=0
Program
import matplotlib
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib.dviread import Text
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
#Tools para teste
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
def main():
dataset = pd.read_csv('filetestKNN.csv')
X = dataset.drop(columns=['Label'])
y = dataset['Label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.34)
Classifier = KNeighborsClassifier(n_neighbors=2, p=2, metric='euclidean')
Classifier.fit(X_train, y_train)
y_pred_class = Classifier.predict(X_test)
y_pred_prob = Classifier.predict_proba(X_test)[:, 1]
accuracy = Classifier.score(X_test, y_test)
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print()
print("Accuracy")
print(metrics.accuracy_score(y_test, y_pred_class))
print()
print("Classification Error")
print(1 - metrics.accuracy_score(y_test, y_pred_class))
print()
print("Confusion matrix")
print(metrics.confusion_matrix(y_test, y_pred_class))
#error
print(metrics.recall_score(y_test, y_pred_class))
#error
print(metrics.roc_curve(y_test, y_pred_class))
#error
print(metrics.f1_score(y_test, y_pred_class))
#error
print(metrics.precision_score(y_test, y_pred_class))
I just wanted to show the algorithm metrics on the screen.
You need to set the average keyword argument to these sklearn.metrics functions. For an example, look at the documentation of f1_score. Here is the part corresponding to the average keyword arg:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).
This alters ‘macro’ to account for label imbalance; it can result in
an F-score that is not between precision and recall.
'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score).
Here we can see that this describes how results are aggregated over the different labels on your multiclass task. I'm not sure which one you'd like to use, but micro seems nice. Here's how your call to f1_score would look with this choice:
print(metrics.f1_score(y_test, y_pred_class, average='micro'))
You can adjust the other metrics similarly. Hope this helps.

How to get accuracy for all the predicted class labels

How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Categories

Resources