How do I calculate the MSE during walk-forward optimization? - python

I am trying to predict the variable "ec" using features with a time lag of 1 period with different models. In order to see which model (I am comparing OLS, Ridge, Lasso and ARIMAX) fits the data best, I use a Walk-Forward approach (expanding window) and want to calculate the Mean Squared Error for each of the models. (I am providing the code for my OLS model as an example) Although my code seems to be working, I am not sure whether the calculation of my MSE is correct: As can be seen in the code below, I am saving the MSE of each loop (Each combination of Training and Test set) in a list (ols_mse_list) and then I calculate the "overall" MSE taking the average of the list. Is that the correct way?? I am slightly confused, as I couldn't find a proper instruction on how to calculate the MSE during the optimization process...
# Separate the predictors and label
X_bss = data_bss[data_bss.columns[~data_bss.columns.isin(
["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"])]]
y = data_bss["ec"]
tscv = expanding_window(initial=350, horizon = 12, period = 1)
for train_index, test_index in tscv.split(X_bss):
print("Train:",train_index)
print("Test :",test_index)
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols_mse_list = []
ols_mean_mse = []
# Loop through the splits. Run a Linear Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(data_bss)):
X_train = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[train_index]
y_train = data_bss[["ec"]].iloc[train_index]
X_test = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[test_index]
y_test = data_bss[["ec"]].iloc[test_index]
ols.fit(X_train,y_train)
ols_mse = mean_squared_error(y_test,ols.predict(X_test))
ols_mse_list.append(ols_mse)
ols_mean_mse.append(np.mean(ols_mse_list))
print("OLS MSE:",ols_mean_mse)

Related

The intercept are half of the real values in logistic regression

For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.

What is the difference between these two ways of specifying training/testing data for sklearn GPR

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.

Variation in output of Logistic Regression when using SMOTE

I am working on a logistic regression case with an imbalance in the target variable. To fix this I am using SMOTE (Synthetic Minority Oversampling Technique), but each time I run my regression model, I get different numbers in my confusion matrix. I have set random_state parameters while invoking SMOTE as well as Logistic Regression still to no avail. Even my features are the same in each iteration. I was able to get the best value for recall as 0.81 and AUC value as 0.916 once but they are not coming anymore. On some occasions, the value of False Positives and False Negatives shoots up very much indicating that the classifier is very bad.
Please guide what I am doing wrong here, below is the code snippet.
# Feature Selection
features = [ 'FEMALE','MALE','SINGLE','UNDER_WEIGHT','OBESE','PROFESSION_ANYS',
'PROFESSION_PROF_UNKNOWN']
# Set X and Y Variables
X5 = dataframe[features]
# Target variable
Y5 = dataframe['PURCHASE']
# Splitting using SMOTE
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state = 4)
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5,Y5, test_size=0.20)
os_data_X5,os_data_Y5 = os.fit_sample(X5_train, Y5_train)
columns = X5_train.columns
os_data_X5 = pd.DataFrame(data = os_data_X5, columns = columns )
os_data_Y5 = pd.DataFrame(data = os_data_Y5, columns = ['PURCHASE'])
# Instantiate Logistic Regression model (using the default parameters)
logreg_5 = LogisticRegression(random_state = 4, penalty='l1', class_weight = 'balanced')
# Fit the model with train data
logreg_5.fit(os_data_X5,os_data_Y5)
# Make predictions on test data set
Y5_pred = logreg_5.predict(X5_test)
# Make Confusion Matrix to compare results against actual values
cnf_matrix = metrics.confusion_matrix(Y5_test, Y5_pred)
cnf_matrix

How to Compare Random Forest (without scaling) and LSTM (with scaling) using RMSE and MAE Performance metrices

I am new to Machine Learning and trying my hands on Bitcoin Price Prediction using multiple Models like Random Forest, Simple Linear Regression and NN(LSTM).
As far as I have read, Random Forest and Linear regression don't require the input feature scaling, whereas LSTM does need the input features to be scaled.
If we compare the MAE and RMSE for both algorithms (with scaling and without scaling), the result would definitely be different and I can't compare which model performs better.
How should I compare the performance of these models now?
Update - Adding my code
Data
bitcoinData = pd.DataFrame([[('2013-04-01 00:07:00'),93.25,93.30,93.30,93.25,93.300000], [('2013-04-01 00:08:00'),100.00,100.00,100.00,100.00,93.300000], [('2013-04-01 00:09:00'),93.30,93.30,93.30,93.30,33.676862]], columns=['time','open', 'close', 'high','low','volume'])
bitcoinData.time = pd.to_datetime(bitcoinData.time)
bitcoinData = bitcoinData.set_index(['time'])
x_train = train_data[['high','low','open','volume']]
y_train = train_data[['close']]
x_test = test_data[['high','low','open','volume']]
y_test = test_data[['close']]
Min-Max Scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler1 = MinMaxScaler(feature_range=(0, 1))
x_train = scaler.fit_transform(x_train)
y_train = scaler1.fit_transform(y_train)
x_test = scaler.transform(x_test)
y_test = scaler1.transform(y_test)
MSE Calculation
from math import sqrt
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
print("Root Mean Squared Error(RMSE) : ", sqrt(mean_squared_error(y_test,preds)))
print("Mean Absolute Error(MAE) : ", mean_absolute_error(y_test,preds))
r2 = r2_score(y_test, preds)
print("R Squared (R2) : ",r2)
You scale your input data, not the output.
The input data is irrelevant for your error calculation.
If you really want to scale your lstm output data, just scale it the same way for the other classifiers.
EDIT:
From yor comment:
I only scaled my input data in LSTM
No you don't. You do transform your output data. And from what i read i assume you only transform it for the neural network.
So your y data for the lstm is around 100 times smaller --> squared_error so you get 100*100 = 10.000 which roughly is the factor your neural net performs "better" than the random forest.
Option 1:
Remove those tree lines:
scaler1 = MinMaxScaler(feature_range=(0, 1))
y_train = scaler1.fit_transform(y_train)
y_test = scaler1.transform(y_test)
Don't forget to use a last layer that can output values to + infinity
Option 2:
Scale the data for your other classifiers as well and compare the scaled values.
Option 3:
Use inverse_transform(pred) method of your MinMaxScaler on your precictions and calculate your errors with the inverse_transformed predictions and the untransformed y_test data.

Building a linear regression model with cross-validation in Python

I have about 1.3k samples of leaf temperature and I'm trying to predict this temperature using atmospheric variables such as air temperature, solar radiation, wind, and humidity.
I started off simple with a multivariate linear regression model, but I wanted to kick it up a notch in terms of accuracy so I decided to try out the leave-one-out cross-validation method in order to get the best model output. I ultimately seek to collect the coefficients and intercept so that I can use this model for later.
Now, from what I understand, cross-validation can have two purposes. The first seems to be to compare the accuracy of your model with that of other models and to decide which is best after going through numerous training data.
The second purpose (and the one I'm trying to use) is that you can use cross-validation in order to improve the accuracy of one single model. In other words, the final model that I'm trying to build has been built after considering all the possible training sets. I have a feeling like I could be all wrong for that 2nd purpose.
Anyways, inspired by what I've seen (most notably this and this), I've developed the following code:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
#Leave ont out cross validation (LOOCV)
#Y_data and X_data are both pandas df
loo = LeaveOneOut()
loo.get_n_splits(X_data)
ytests = []
ypreds = []
All_coef = list()
All_intercept = list()
for train_index, test_index in loo.split(X_data):
X_train, X_test = X_data.iloc[train_index], X_data.iloc[test_index]
Y_train, Y_test = Y_data.iloc[train_index], Y_data.iloc[test_index]
model = LinearRegression()
model.fit(X=X_train, y=Y_train)
Y_pred = model.predict(X_test)
All_coef.append(model.coef_)
All_intercept.append(model.intercept_)
ytests += Y_test.values.tolist()[0]
ypreds += list(Y_pred)
rr = metrics.r2_score(ytests, ypreds)
ms_error = metrics.mean_squared_error(ytests, ypreds)
But this is weird because the linear regression is within the cross-validation loop-thing and not out of it, so I can't really get a final model out of this. Am I suppose to have the LinearRegression() and .fit() outside of the loop as well? If so then how do I validate the final model?
I was also thinking of how I'm supposed to get the coefficients and the intercept from my model. If I am to keep the linear regression within the loop, that means I'll acquire coefficients for every training set. Would it be wise to make some sort of mean out of it?
Thank you so much for your consideration!

Categories

Resources