Multivariate Linear Regression, coefficients don't match - python

I'm facing a problem with different linear models from scikit-learn.
There is my code
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_train).reshape(-1)
print(f"R2 on train set:{reg.score(X_train, y_train)}")
print(f"R2 on test set:{reg.score(X_test, y_test)}")
print(f"MSE on train set:{mean_squared_error(y_train, y_pred)}")
print(f"MSE on test set:{mean_squared_error(y_test, reg.predict(X_test))}")
output:
>R2 on train set:0.5810258473777401
>R2 on test set:0.5908396388537969
>MSE on train set:0.023576848498732563
>MSE on test set:0.02378699441936436
Model is fitted, now I want to get the slope coefficient and the intercept from my model:
A, B = reg.coef_[0], reg.intercept_[0]
A, B
output:
>(array([ 0.14373081, -1.8211677 , 1.81493948, 1.39041689, -0.14027746]),
> 0.060286931992710735)
Since I used 5 features to fit the model I also have 5 slope coefficients, ok.
But when I try to visualize y_true, y_pred and the regression (ax +b) it's looks wrong for the regression of the second feature (total rooms). Since it has -1.81 as coef slope it's look logic but if the predictions of the model look fine, how it's possible to have this regression looks that bad, it make no sense right ?
I think that the return of reg.coef_ is not in the same order as the features the model is fitted with. But as far as I have see, it should be the same order, so idk.
There is also this part of code, that plot the regression just in case
sns.lineplot(x=X[:, i], y=(a[i]*X[:, i])+b, label="regression", color=c3, alpha=1, ci=None, ax=axes[i])
Any idea ?
I keep in mind that there may be no problem at all but visually it hurts a bit

y_pred is a quantifier for listreg. We introduce N as an other variable which cannot be quantified or consecutive.
N=ax/k-b of a scatter plot. This helps to find the total shape or size of the bedroom; b, b=l.
5 is right. 5 is independent of the regression. I mean of an independent variable.

Related

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

Seaborn Regplot and Scikit-Learn Logistic Models Calculated Differently?

I'm using both the Scikit-Learn and Seaborn logistic regression functions -- the former for extracting model info (i.e. log-odds, parameters, etc.) and the later for plotting the resulting sigmoidal curve fit to the probability estimations.
Maybe my intuition is incorrect for how to interpret this plot, but I don't seem to be getting results as I'd expect:
#Build and visualize a simple logistic regression
ap_X = ap[['TOEFL Score']].values
ap_y = ap['Chance of Admit'].values
ap_lr = LogisticRegression()
ap_lr.fit(ap_X, ap_y)
def ap_log_regplot(ap_X, ap_y):
plt.figure(figsize=(15,10))
sns.regplot(ap_X, ap_y, logistic=True, color='green')
return None
ap_log_regplot(ap_X, ap_y)
plt.xlabel('TOEFL Score')
plt.ylabel('Probability')
plt.title('Logistic Regression: Probability of High Chance by TOEFL Score')
plt.show
Seems alright, but then I attempt to use the predict_proba function in Scikit-Learn to find the probabilities of Chance to Admit given some arbitrary value for TOEFL Score (in this case 108, 104, and 112):
eight = ap_lr.predict_proba(108)[:, 1]
four = ap_lr.predict_proba(104)[:, 1]
twelve = ap_lr.predict_proba(112)[:, 1]
print(eight, four, twelve)
Where I get:
[0.49939019] [0.44665597] [0.55213799]
To me, this seems to indicate that a TOEFL Score of 112 gives an individual a 55% chance of being admitted based on this data set. If I were to extend a vertical line from 112 on the x-axis to the sigmoid curve, I'd expect the intersection at around .90.
Am I interpreting/modeling this correctly? I realize that I'm using two different packages to calculate the model coefficients but with another model using a different data set, I seem to get correct predictions that fit the logistic curve.
Any ideas or am I completely modeling/interpreting this inaccurately?
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('log: ', metrics.accuracy_score(y_test, y_pred))
you can easily find model accuracy like this and decide which model you can use for your application data.
After some searching, Cross-Validated provided the correct answer to my question. Although it already exists on Cross-Validated, I wanted to provide this answer on Stack Overflow as well.
Simply put, Scikit-Learn automatically adds a regularization penalty to the logistic model that shrinks the coefficients. Statsmodels does not add this penalty. There is apparently no way to turn this off so one has to set the C= parameter within the LogisticRegression instantiation to some arbitrarily high value like C=1e9.
After trying this and comparing the Scikit-Learn predict_proba() to the sigmoidal graph produced by regplot (which uses statsmodels for its calculation), the probability estimates align.
Link to full post: https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels

Unable to obtain accuracy score for my linear

I am working on my regression model based on the IMDB data, to predict IMDB value. On my linear-regression, i was unable to obtain the accuracy score.
my line of code:
metrics.accuracy_score(test_y, linear_predicted_rating)
Error :
ValueError: continuous is not supported
if i were to change that line to obtain the r2 score,
metrics.r2_score(test_y,linear_predicted_rating)
i was able to obtain r2 without any error.
Any clue why i am seeing this?
Thanks.
Edit:
One thing i found out is test_y is panda data frame whereas the linear_predicted_rating is in numpy array format.
metrics.accuracy_score is used to measure classification accuracy, it can't be used to measure accuracy of regression model because it doesn't make sense to see accuracy for regression - predictions rarely can equal the expected values. And if predictions differ from expected values by 1%, the accuracy will be zero, though these predictions are great
Here are some metrics for regression: http://scikit-learn.org/stable/modules/classes.html#regression-metrics
NOTE: Accuracy (e.g. classification accuracy) is a measure for classification, not regression so we can't calculate accuracy for a regression model. For regression, one of the matrices we've to get the score (ambiguously termed as accuracy) is R-squared (R2).
You can get the R2 score (i.e accuracy) of your prediction using the score(X, y, sample_weight=None) function from LinearRegression as follows by changing the logic accordingly.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
r2_score = regressor.score(x_test,y_test)
print(r2_score*100,'%')
output (a/c to my model)
86.23%
The above is R squared value and not the accuracy :
# R squared value
metrics.explained_variance_score(y_test, predictions)
What does your variables look like. Code below works well.
from sklearn import metrics
test_y, linear_predicted_rating = [1,2,3,4], [1,2,3,5]
metrics.accuracy_score(test_y, linear_predicted_rating)
You can not predict the accuracy of regression model,however you can analyze your model using Mean absolute error ,Mean squared error ,Root mean squared error,Max error,median error R-square etc.
for reference
you can go this to gain more knowledge

How to avoid float values in regression models

I am trying to predict wine quality (ranges from 1 to 10) using regression models such as linear,SGDRegressor, ridge,lasso.
dataset:http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Independent values:volatile acidity,residual sugar,free sulfur dioxide,total sulfur dioxide,alchohol
Dependent:Quality
Linear model
regr = linear_model.LinearRegression(n_jobs=3)
regr.fit(x_train, y_train)
predicted = regr.predict(x_test)
predicted values for LinearRegression
array([ 5.33560542, 5.47347404, 6.09337194, ..., 5.67566813,
5.43609198, 6.08189 ])
predicted values are in float instead of (1,2,3...10)
I tried to round predicted values using numpy
predicted = np.round(regr.predict(x_test))` but my accuracy gone down with this attempt.
SGDRegressor model.
from sklearn import linear_model
np.random.seed(0)
clf = linear_model.SGDRegressor()
clf.fit(x_train, y_train)
redicted = np.floor(clf.predict(x_test))
predicted output values for SGDRegressor:
array([ -2.77685458e+12, 3.26826414e+12, 4.18655713e+11, ...,
4.72375220e+12, -7.08866307e+11, 3.95571514e+12])
Here I am unable to convert the output values into integers.
Could someone please let me know the best way to predict the wine quality using these regression models.
You are doing a regression and therefore the output is continuous in nature.
The thing you should note is that your mini-project on predicting wine quality is not a classification problem. The response variable y, the wine quality, has intrinsic order which means a score of 6 is strictly better than a score of 5. It is NOT categorical variable where different numbers just represent different groups where groups are non-comparable.

How to get the best coefficient vector using cross-validation

I am running ridge regression on a dataset. I have done 5 fold cross validation. So basically my dataset is divided into 5 train and 5 test folds.
This is how I did in scikit:
from sklearn import cross_validation
k_fold=cross_validation.KFold(n=len(tourism_train_X),n_folds=5)
I set the regularisation parameter like this:
#Generating alpha values for regularization parameters
n_alphas = 200
alphas = np.logspace(-10, -1, n_alphas)
Now , my doubt is, for each train and test fold
I do something like this.
ridge_tourism = linear_model.Ridge()
for a in alphas:
ridge_tourism.set_params(alpha=a)
index=0
for train_indices, test_indices in k_fold:
ridge_tourism.fit(tourism_train_X[train_indices], tourism_train_Y[train_indices]) # Fitting the model
coefs.append(ridge_tourism.coef_)
The problem is it would give me coefficient vector for each of the five training fold within each alpha. All I want is for each alpha what is the best coefficient vector chosen. How do we get that? How do we choose out of 5 train sets which coefficient vector is finally reported for that alpha?
For each alpha value, take the mean of the validation error of the 5 folds validation. Then you will be able to get a curve for mean validation error v.s. alpha. Choose the alpha value, which gives the lowest mean validation error.

Categories

Resources