In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:
to cross validate a time series data, the training and testing data are often splitted like this:
That is to say, the testing data should be always ahead of training data.
My thought is:
Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.
Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.
Any suggestion is welcome.
I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.
After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.
I did some work regarding all this some months ago.
You could check it out in this question/answer:
Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis
My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.
What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.
I've written some code that I hope could be helpful to someone.
'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation
for sequence in range(40, df.sequence.max() + 1):
train = df[df['sequence'] < sequence]
test = df[df['sequence'] == sequence]
X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
y_train, y_test = train['quantity'].values, test['quantity'].values
mdl = LinearRegression()
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
RMSE.append(error)
print('Mean RMSE = %.5f' % np.mean(RMSE))
Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):
def tscv(X, train_size, test_size):
folds = math.floor(len(X) / test_size)
tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
splits = []
for train_index, test_index in tscv.split(X):
if len(train_index) < train_size:
continue
elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
pass
else:
train_index = train_index[-train_size:]
splits.append([train_index, test_index])
return splits
I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:
|X||V|O|O|O|
|O|X||V|O|O|
|O|O|X||V|O|
|O|O|O|X||V|
X / V are the training / validation sets.
"||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.
You could easily extend it to get longer lookback windows for the training sets.
class StratifiedWalkForward(object):
def __init__(self,n_splits,n_gap):
self.n_splits = n_splits
self.n_gap = n_gap
self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
return
def split(self,X,y,groups=None):
splits = self._cv.split(X,y)
_ixs = []
for ix in splits:
_ixs.append(ix[1])
for i in range(1,len(_ixs)):
yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
def get_n_splits(self,X,y,groups=None):
return self.n_splits
Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.
Related
I get why a model's score is different for each random_state but did expect the difference between the highest and the lowest score (from random_state 0-100) to be 0.37 which is lot. Also tried ten-fold-cross-validation, the difference is still kinda big.
So does this actually matter or is it something i should ignore ?
The Data-set link
(Download -> Data Folder -> student.zip -> student-mat.csv)
Full Code :
import pandas as pd
acc_dic = {}
grade_df_main = pd.read_csv(r'F:\Python\Jupyter Notebook\ML Projects\data\student-math-grade.csv', sep = ";")
grade_df = grade_df_main[["G1", "G2", "G3", "studytime", "failures", "absences"]]
X = grade_df.drop("G3", axis = "columns")
Y = grade_df["G3"].copy()
def cross_val_scores(scores):
print("Cross validation result :-")
#print("Scores: {}".format(scores))
print("Mean: {}".format(scores.mean()))
print("Standard deviation: {}".format(scores.std()))
def start(rand_state):
print("Index {}".format(rand_state))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.1, random_state=rand_state)
from sklearn.linear_model import LinearRegression
lin_reg_obj = LinearRegression()
lin_reg_obj.fit(x_train, y_train)
accuracy = lin_reg_obj.score(x_test, y_test)
print("Accuracy: {}".format(accuracy))
acc_dic[rand_state] = accuracy
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lin_reg_obj, x_test, y_test, scoring="neg_mean_squared_error", cv=10)
cross_val_scores(scores)
print()
for i in range(0, 101):
start(i)
print("Overview : \n")
result_val = list(acc_dic.values())
min_index = result_val.index(min(result_val))
max_index = result_val.index(max(result_val))
print("Minimum Accuracy : ")
start(min_index)
print("Maximum Accuracy : ")
start(max_index)
Result :
Only included the highest and the lowest results
Minimum Accuracy :
Index 54
Accuracy: 0.5635271419142645
Cross validation result :-
Mean: -8.969894370977539
Standard deviation: 5.614516642510817
Maximum Accuracy :
Index 97
Accuracy: 0.9426035720345269
Cross validation result :-
Mean: -0.7063598117158191
Standard deviation: 0.3149445166291036
TL;DR
It is not the split on the dataset you used to train and evaluate your model that decides how well your final model will actually perform once it is deployed. The split and evaluation technique is more about getting a valid estimation of how well the model might perform in real life. And as you can see, the choice of the splitting and evaluation technique can have a great influence on this estimation. The results on your dataset highly suggest preferring k-fold cross-validation over a simple train/test split.
Longer version
I believe you have already figured out that the split you do on the dataset to separate it into train and test sets has nothing to do with the performance of your final model, which is likely to be trained on the whole dataset and then be deployed.
The purpose of testing is to get a feeling of the predictive performance on unseen data. In a best-case scenario, you would ideally have two completely different data sets from different cohorts/sources to train and test your model (external validation). This is the best approach to evaluate how your model would perform once it is deployed. However, since you often do not have such a second source of data, you do an internal validation where you get samples for training and testing from the same cohort/source.
Usually, given that this dataset is big enough, randomness will make sure that the splits for the train and test sets are a good representation of your original dataset and the performance metrics you get are a fair estimation of the model's predictive performance in real life.
However, as you see on your own dataset, there are cases where the split does actually heavily influence the result. It is exactly for such cases, where you are definitely better off evaluating your performance with a cross-validation technique such as k-fold cross-validation, and compute the mean across different splits.
If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.
training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test
i get negative r2 scores, and high mse
training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])
y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))]
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
I was expecting more or less the same results, and all the data types seem the same for both calls.
What am I missing?
I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).
Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.
Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.
Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)
If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).
P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)
1 - http://cv.znu.ac.ir/afsharchim/AI/lectures/Decision%20Trees%203.pdf
The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.
When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.
If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.
suppose your dataset contains this data.
1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0
So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better
1+1=2
2-2= 0
So it knows what do to when it sees this data
2+2
4-4#since it learned both addition and subtraction
But when you manually shuffle it like this
1 + 1 = 2
2 + 2 =4#only learned addition
It doesn't know what do do when it sees this data
2 - 2
4 - 4#test data is subtraction
Hope this answers you question
It may sound like a simple check but..
In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.
The other line that might be out of place is line 6:
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))]
good luck, let us know if this helps.
I have a Python code that works well for performing k-fold CV on a dataset. My Python code looks like this:
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle
# Load the dataset.
dataset = pandas.read_csv('values.csv')
# Preprocessing the dataset.
X = dataset.iloc[:, 0:8]
Y = dataset.iloc[:, 8] # The class value is the last column and is called Outcome.
# Scale all values to 0,1.
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
# 3-fold CV computation.
scores = []
svr_rbf = SVR(kernel='rbf', gamma='auto')
cv = KFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
svr_rbf.fit(X_train, Y_train)
scores.append(svr_rbf.score(X_test, Y_test))
Now, I want to rewrite the same thing in R, and I tried to do something like this:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("values.csv", col_names=TRUE)
results <- train(Outcome~.,
data=dataset,
method="smvLinear",
trControl=trainControl(
method="cv",
number=3,
savePredictions=TRUE,
verboseIter=TRUE
))
print(results)
print(results$pred)
My data is similar to this one: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data
Except this one has 12 attributes, and 13th column is the class, in my case there are 8 attributes, and 9th one is the class. But, value-wise it is similar.
Now, I can see the results printing, however there are a few things unclear to me.
1) In my Python code, I did this scaling of values, how can I do that in R?
2) I have used SVR with rbf kernel, how can I use SVR with that kernel in R instead of SMV?
3) Also, in Python version I use a random_state=42 (just a random number) to generating the splittings for the folds, so it uses different folds. But it is consistent throughout different executions. How do this in R?
4) Lastly, in Python I do the training inside a for loop per fold. I need something like this in R too, as after every fold, I want to perform some other statistics and computations. How do do this in R?
5) Should I stick to caret or use mlr package? Does mlr do the k-fold CV too? If yes how?
EDIT:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
print(dataset)
X = dataset[, 1:8]
print(X)
Y = dataset$X9
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
rfCaret <- caret::train(X, Y, method = "svmLinear", trControl = fit.control)
print(rfCaret)
Check out createFolds in the caret package for fixed folds.
Here's some code that you can amend to fit your particular modelling case; this example would build out a randomforest model, but you can switch the model for an SVM. If you follow the package guide there's a link (copied here for ease: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines) - section 7.0.47 lists all the available SVM models and their parameters.Note that you may need to install some additional packages, like kernlab, to use specific models.
There is a package called rngtools that is supposed to allow you to create reproducible models across multiple cores (parallel processing), but if you want to be sure then single core is probably the best way in my experience.
folds <- 3
set.seed(42)
cvIndex <- createFolds(your_data, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
search.grid <- expand.grid(.mtry = c(seq.int(1:sqrt(length(your_data)))+1))
rfCaret <- train(your_data_x, your_data_y, method = "rf",
metric = 'ROC', ntree = 500,
trControl = fit.control, tuneGrid = search.grid,
)
In my experience, caret is pretty good for covering pretty much all bases. If you also want to preprocess your data (e.g. centre, scale) - then you want the function preProcess - again, details in the caret package if you type ?train - but for example you would want
preProcess(yourData, method = c("center", "scale"))
Caret is clever in that it understands if it has taken a preprocessed input, and applies the same scaling to your test data sets.
edit - additional: unused parameters issue
To answer your follow up question about unused parameters - it's probably because you're using mtry which is a random forest parameter.
Here's a version for a simple SVM:
folds <- 3
set.seed(42)
cvIndex <- createFolds(dataset$Outcome, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
SVMCaret <- train(Outcome ~ ., data = dataset, method = "svmLinear",
metric = 'ROC',
trControl = fit.control)
You don't need a tuning grid; Caret will generate a random one. Of course if you want to test specific cost values, then create one yourself in much the same way as I did for the .mtry parameter for randomForests.
1) The caret::train function has a preProcess argument that allows you to do choose a preprocessing. See ?caret::train for more details.
2) There is svmRadial available for caret. You can look at examples and all available algorithms at caret/train-models-by-tag.
3) Fix the random seed with set.seed(123) for consistency. You may access training folds in the train object (results$trainingData here).
4) Don't loop, access your folds directly via your train object and calculate your stats if needed (see results$resample)
5) mlr has cross-validation too, it depends which flavor you like.
I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).
I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:
Following this answer, I did the following:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.
I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.
Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.
Thanks!
You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.
Yes, there was a data leaking problem for the validation data. You need to set refit = False for GridSearchCV and it will not refit the whole data including training and validation data.
I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.