Building a linear regression model with cross-validation in Python

Building a linear regression model with cross-validation in Python - python

I have about 1.3k samples of leaf temperature and I'm trying to predict this temperature using atmospheric variables such as air temperature, solar radiation, wind, and humidity.
I started off simple with a multivariate linear regression model, but I wanted to kick it up a notch in terms of accuracy so I decided to try out the leave-one-out cross-validation method in order to get the best model output. I ultimately seek to collect the coefficients and intercept so that I can use this model for later.
Now, from what I understand, cross-validation can have two purposes. The first seems to be to compare the accuracy of your model with that of other models and to decide which is best after going through numerous training data.
The second purpose (and the one I'm trying to use) is that you can use cross-validation in order to improve the accuracy of one single model. In other words, the final model that I'm trying to build has been built after considering all the possible training sets. I have a feeling like I could be all wrong for that 2nd purpose.
Anyways, inspired by what I've seen (most notably this and this), I've developed the following code:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
#Leave ont out cross validation (LOOCV)
#Y_data and X_data are both pandas df
loo = LeaveOneOut()
loo.get_n_splits(X_data)
ytests = []
ypreds = []
All_coef = list()
All_intercept = list()
for train_index, test_index in loo.split(X_data):
X_train, X_test = X_data.iloc[train_index], X_data.iloc[test_index]
Y_train, Y_test = Y_data.iloc[train_index], Y_data.iloc[test_index]
model = LinearRegression()
model.fit(X=X_train, y=Y_train)
Y_pred = model.predict(X_test)
All_coef.append(model.coef_)
All_intercept.append(model.intercept_)
ytests += Y_test.values.tolist()[0]
ypreds += list(Y_pred)
rr = metrics.r2_score(ytests, ypreds)
ms_error = metrics.mean_squared_error(ytests, ypreds)
But this is weird because the linear regression is within the cross-validation loop-thing and not out of it, so I can't really get a final model out of this. Am I suppose to have the LinearRegression() and .fit() outside of the loop as well? If so then how do I validate the final model?
I was also thinking of how I'm supposed to get the coefficients and the intercept from my model. If I am to keep the linear regression within the loop, that means I'll acquire coefficients for every training set. Would it be wise to make some sort of mean out of it?
Thank you so much for your consideration!

Related

How do I predict the future closing price of stock after training and testing?

I am trying to do multivariate time series forecasting using linear regression model.
In the below code I first split the data in 80-20 ratio for training and testing.
Then I train the model and use the model to predict using test and compute the relevant performance metrics of the model.
# Split data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(df[['EMA_10']], df[['close']], test_size=.2)
# Create Regression Model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Use model to make predictions
y_pred = model.predict(X_test)
# Printout relevant metrics
print("Model Coefficients:", model.coef_)
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Coefficient of Determination:", r2_score(y_test, y_pred))
Now how do I predict the next i.e. future value?

To predict unseen y, you can simply use .predict(<new x here>).
However, why are you using linear regression to tackle the time series problem? It makes the data lose the time dimension. It's important to note that when performing time series forecasting, it's generally a good idea to use a model specifically designed for time series data, such as an autoregressive model (e.g., ARIMA) or advanced DL (e.g., RNN). These kinds of models are able to account for the temporal dependencies that are present in time series data, which can help improve the accuracy of the forecasts.
There are many good resources for that, such as,
https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!

This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.

I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

Model Fitting and Cross-Validation

I'm training a dataset and then testing it on some other dataset.
To improve performance, I wanted to fine-tune my parameters with a 5-fold cross validation.
However, I think I'm not writing the correct code as when I try to fit the model to my testing set, it says it hasn't fit it yet. I though the cross-validation part fitted the model? Or maybe I have to extract it?
Here's my code:
svm = SVC(kernel='rbf', probability=True, random_state=42)
accuracies = cross_val_score(svm, data_train, lbs_train, cv=5)
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)

That is correct, the cross_validate_score doesn't return a fitted model. In your example, you have cv=5 which means that the model was fit 5 times. So, which of those do you want? The last?
The function cross_val_score is a simpler version of the sklearn.model_selection.cross_validate. Which doesn't only return the scores, but more information.
So you can do something like this:
from sklearn.model_selection import cross_validate
svm = SVC(kernel='rbf', probability=True, random_state=42)
cv_results = cross_validate(svm, data_train, lbs_train, cv=5, return_estimator=True)
# cv_results is a dict with the following keys:
# 'test_score' which is what cross_val_score returns
# 'train_score'
# 'fit_time'
# 'score_time'
# 'estimator' which is a tuple of size cv and only if return_estimator=True
accuracies = cv_results['test_score'] # what you had before
svms = cv_results['estimator']
print(len(svms)) # 5
svm = svms[-1] # the last fitted svm, or pick any that you want
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
Note, here you need to pick one of the 5 fitted SVMs. Ideally, you would use cross-validation for testing the performance of your model. So, you don't need to do it again at the end. Then, you would fit your model one more time, but this time with ALL the data which would be the model you will actually use in production.
Another note, you mentioned that you want this to fine tune the parameters of your model. Perhaps you should look at hyper-parameter optimization. For example: https://datascience.stackexchange.com/a/36087/54395 here you will see how to use cross-validation and define a parameter search space.

Getting probabilities of best model for RandomizedSearchCV

I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_

If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.

You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Does fitting a sklearn Linear Regression classifier multiple times add data points or just replace them?

X = np.array(df.drop([label], 1))
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df[label])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
linReg.fit(X_train, y_train)
I've been fitting my linear regression classifier over and over again with data from different spreadsheets under the assumption that every time I fit the same model with a new spreadsheet, it is adding points and making the model more robust.
Was this assumption correct? Or am I just wiping the model every time I fit it?
If so, is there a way for me to fit my model multiple times for this 'cumulative' type effect?

Linear regression is a batch (aka. offline) training method, you can't add knowledge with new patterns. So, sklearn is re-fitting the whole model. The only way to add data is to append the new patterns to your original training X, Y matrices and re-fit.

You're almost certainly wiping your mode land starting from scratch. To do what you want, you need to append the additional data to the bottom of your data frame and re-fit using that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.