Relationship between sklearn .fit() and .score() - python

While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?

fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)

Related

How do I predict the future closing price of stock after training and testing?

I am trying to do multivariate time series forecasting using linear regression model.
In the below code I first split the data in 80-20 ratio for training and testing.
Then I train the model and use the model to predict using test and compute the relevant performance metrics of the model.
# Split data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(df[['EMA_10']], df[['close']], test_size=.2)
# Create Regression Model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Use model to make predictions
y_pred = model.predict(X_test)
# Printout relevant metrics
print("Model Coefficients:", model.coef_)
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Coefficient of Determination:", r2_score(y_test, y_pred))
Now how do I predict the next i.e. future value?
To predict unseen y, you can simply use .predict(<new x here>).
However, why are you using linear regression to tackle the time series problem? It makes the data lose the time dimension. It's important to note that when performing time series forecasting, it's generally a good idea to use a model specifically designed for time series data, such as an autoregressive model (e.g., ARIMA) or advanced DL (e.g., RNN). These kinds of models are able to account for the temporal dependencies that are present in time series data, which can help improve the accuracy of the forecasts.
There are many good resources for that, such as,
https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

What is the difference between grid.score(X_valid, y_valid) and grid.best_score_

While doing GridSearchCV, what is the difference between the scores obtained through grid.score(...) and grid.best_score_
Kindly assume that a model, features, target, and param_grid are in place. Here is a part of the code I am very curious to know about.
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
There are two different outputs for each of best_score_1 and best_score_2
I am trying to know the difference between the two as well as which of the following should be considered to be the best scores that came out from the given param_grid.
Following is the full function.
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
UPDATED
As explained in the sklearn documentation, GridSearchCV takes all the parameter lists of parameters you pass and tries all possible combinations to find the best parameters.
To evaluate which are the best parameters, it calculates a k-fold cross-validation for each parameters combination. With k-fold cross-validation, the training set is divided into Training set and Validation set (which is a test set). If you choose, for example, cv=5 the dataset is divided into 5 non-overlapping folds, and each fold is used as a validation set, while all the other are used as training set. Hence, GridSearchCV, in the example, calculates the average validation score (which can be accuracy or something else) for each of the 5 folds, and does so for each parameters combination. Then, at the end of GridsearchCV there will be an average validation score for each parameter combination, and the one with the highest average validation score is returned. So, the average validation score, associated to the best parameters, is stored in the grid.best_score_ variable.
On the other hand, the grid.score(X_valid, y_valid) method gives the score on the given data, if the estimator has been refitted (refit=True).This means that it is not the average accuracy of the 5 folds, but is taken the model with the best parameters and is trained using the training set. Then, are computed the predictions on the X_valid and compared compared with the y_valid in order to get the score.

Regarding increase in MSE of Cross-Validation model with increasing dataset for regression

I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.
You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.

Machinelearning, how to make a forecast from learning and training data

ive tried to do some machinelearning in python with pandas. My goal was to estimate the insurance costs of people based on their lifestyle. i got a nice database from kaggle. Doing training and testing on my dataset went quite well but now i want to make some forecast for a person and i dont know how to start.
i post what i have done so far with training and testing with a linear regression (i did also a lot of other stuff like monte carlo, knearest, ...)
the result is
Accuracy on training set: 0.735
Accuracy on test set: 0.795
so how would you recommend to continue estimating the insurance cost of another person?
#Linear Regression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(linreg.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(linreg.score(X_test, y_test)))```
As you have already 'fit' the algorithm on X_train and y_train dataset, you can make predictions for X_test as follows:
predictions = linreg.predict(X_test)
Basically, linreg.fit(X_train, y_train) means fitting/training using X_train as inputs and y_train as (targeted) labels. On the other hand, linreg.predict(X_test) means using X_test as inputs to produce predictions, and linreg.score(X_test, y_test) means making predictions using X_test as inputs then comparing the predictions with the (targeted) y_test to get (accuracy) score.

Categories

Resources