Predict future values using linear regression - python

So i've made a model for values prediction using linear regression. And now i need to get it to predict for 2022-2024 years into the future. how can i do it? maybe add rows 2023-2024 to dataframe? but will it be correct?
Data
data['Year'] = pd.to_datetime(data['Year'])
data.index = data['Year']
data.drop(['Year'], axis=1, inplace=True)
data = data.bfill().ffill()
y = data['x4']
X = data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

If you want to predict you just have to put the new data:
X_new = new_data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
y_new = model.predict(X_new)
model is your linear regression which was trained, now you predict with the new data, in the same order and same format you did for your X_train/X_test and that's it

Related

prediction scores on test vs train dataset different with/without gridsearchcv

I am exploring the use of GridSearchCV from sklearn to predict data. After the fit of the data using RandomForestRegressor, I calculate the score (MSE) for the test and the train data. I can see there is a huge difference between the MSE of train and the MSE of test (even if the scores should be similar).
Here is the code:
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create Regressors Pipeline
pipeline_estimators = Pipeline([
('RandomForest', RandomForestRegressor()),
])
param_grid = [{'RandomForest__n_estimators': np.linspace(50, 100, 3).astype(int)}]
search = GridSearchCV(estimator = pipeline_estimators,
param_grid = param_grid,
scoring = 'neg_mean_squared_error',
cv = 2,)
search.fit(X_train, y_train)
y_test_predicted = search.best_estimator_.predict(X_test)
y_train_predicted = search.best_estimator_.predict(X_train)
print('MSE test predict', metrics.mean_squared_error(y_test, y_test_predicted))
print('MSE train predict',metrics.mean_squared_error(y_train, y_train_predicted))
The OUT are:
MSE test predict 0.0021045875412650343
MSE train predict 0.000332850878980335
IF I don't use Gridsearchcv but a FOR loop for the differet 'n_estimators', the MSE scores obtained for the predicted test and the train are very close.
To add more detail related to the 'FOR loop' explanation, this is by using simple approach, see code below:
n_estimators = np.linspace(50, 100, 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
mse_train = []
mse_test = []
for val_n_estimators in n_estimators:
regressor = RandomForestRegressor(n_estimators = val_n_estimators)
regressor.fit(X, y)
y_pred = regressor.predict(X_test)
y_test_predicted = regressor.predict(X_test)
y_train_predicted = regressor.predict(X_train)
mse_train.append(metrics.mean_squared_error(y_train, y_train_predicted))
mse_test.append(metrics.mean_squared_error(y_test, y_test_predicted)) code here
For this code, the mse_train and mse_test are very similar. But using the Gridseachcv (see code on the top of the post), they are not similar.
Any suggestions?
Why there is such scores difference using GridSearchCV?
Thank you.
Marc

How do you get a prediction for a specific value in linear regression in python

i have trained my ML model with linear regression using these
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
how do i get a prediction for a single row in the data set?
You can easily do something like this-
y_pred = regressor.predict(X_test)
So if you want to do inference on only a single row in the dataset let's say the row at 2nd index. You would do-
y_pred = regressor.predict(X_test[2])
Scikit-learn models all have a predict method you can use.
Just pass it your row as an array and you'll be fine:
regressor.predict(x_val)
Since you only want the first row, you could do
first_row = X_test[0] //assuming X_test is where your test data is at
y_pred = regressor.predict(first_row)

How to get a specific row for testing and other for training?

I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.

How to access scikit-learn's cross validation X and Y scores?

I'm using scikit-learn's linear regression in combination with (k-fold) cross validation. Now, to calculate a t-test I need access to my array of errors (y_test - y_pred).
# Load dataset
smartphone = pd.read_csv('all-users_w4_filtered.csv')
# Define Independent variable (X) and Dependent variable (Y)
x=smartphone[['mood_mean', 'valence_mean', 'app.social_mean', 'app.other_mean']].values
y=smartphone['target'].values
z=smartphone['benchmark'].values
# Reshape data
x = x.reshape(-1, 4)
y = y.reshape(-1, 1)
# First shuffle the data, then perform k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(regr, x, y, cv=kf)
print ('5-fold shuffled cross validation scores: ', scores)
print ('Mean of cross-validation:', scores.mean())
for train_index, test_index in kf.split(x):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
Output of scores:
5-fold shuffled cross validation scores: [ 0.21801002 0.3282497
0.27146692 0.36056872 0.29064657] Mean of cross-validation: 0.293788384892
How do I access the arrays of y_pred which is used to calculate the MSE's in cross_val_score?
I tried replicating using regr.fit / regr.predict on the indexed data in the loop, but this yielded different results.

Scaling of validation data in supervised ML algorithm

I have written a classification algorithm in Python. It satisfies to Scikit-Learn's API. Given labeled data X, y, I would like to train my algorithm on this data in the following way:
X, y are split into X_aux, y_aux and X_test, y_test.
X_aux, y_aux are split into X_train, y_train and X_val, y_val.
Then, using Scikit-Learn, I define a Pipeline which is the concatenation of a StandardScaler (for feature normalization) and my model. Eventually, the pipeline is trained and evaluated as follows:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', Model())])
pipe.fit(X_train, y_train, validation_data = (X_val, y_val))
pred_proba = pipe.predict_proba(X_test)
score = roc_auc_score(y_test, pred_proba)
The fit method of Model accepts a validation_data parameter to monitor progress during training and possibly avoid overfitting. To this aim, at each iteration, the fit method prints the model loss on training data (X_train, y_train) (training loss) and model loss on validation data (X_val, y_val) (validation loss). In addition to validation loss, I also would like the fit method to return ROC AUC score on validation data. My question is the following :
Shall X_val be normalized with the scaler of the pipeline before it is used to compute validation ROC AUC score during training ? Also, in this code, only X_train is normalized by the scaler. Should I do X_aux = scaler.fit_transform(X_aux) instead and then split into train/validation ?
I apologize in advance for my question is very naive. I confess I got confused.
I think that X_val should be normalized. The way I see it is that the few lines of code above are equivalent to:
scaler = StandardScaler()
clf = Model()
X_train = scaler.fit_transform(X_train)
clf.fit(X_train, y_train, validation_data = (X_val, y_val))
# During `fit`, at each iteration, we would have:
# train_loss = loss(X_train, y_train)
# validation_loss = loss(X_val, y_val)
# pred_proba_val = predict_proba(X_val, y_val) (*)
# roc_auc_val = roc_auc_score(y_val, pred_proba_val)
X_test = scaler.transform(X_test)
pred_proba = clf.predict_proba(X_test) (**)
score = roc_auc_score(y_test, pred_proba)
In line (*) the predict_proba method is called on unnormalized data whereas it is called on normalized data on line (**). This is why I believe that X_val should be normalized. Still I am not sure whether my thinking is correct.

Categories

Resources