Can I standardize all data before doing cross validation?

Can I standardize all data before doing cross validation? - python

I am doing a small machine learning project.
I doubt that, can I standardize all of the data, X, first (except the label, Y).
I only saw the developer standardize the train set with fit_transform() and test set with transform() after splitting the train - test set
example of the code:
import pandas as pd
dataset = pd.read_csv('../../dataset/dataset_experiment_1.csv')
X_no_stdize = dataset.iloc[:,:-1].values
y = dataset.iloc[:,86].values
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X_no_stdize)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
print('XGBoost')
model = XGBClassifier(booster='gbtree', objective='binary:logistic', learning_rate=0.2, max_depth=3)
f1_score = cross_val_score(model, X, y, cv=kfold, scoring=scoring3)
print('f1-score: ', f1_score.mean(), ' +- ', f1_score.std())
Thanks in advance for your assistance.

Related

MAE using Pipeline and GridSearchCV

I am facing a challenge finding Mean Average Error (MAE) using Pipeline and GridSearchCV
Background:
I have worked on a Data Science project (MWE as below) where a MAE value would be returned of a classifier as it's performance metric.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)
#RandomForest Prediction
y_predict = RF_model.predict(X_valid)
#MAE
print(mean_absolute_error(y_valid, y_predict))
#Output:
# 0.38727149627623564
Challenge:
Now I am trying to implement the same using Pipeline and GridSearchCV (MWE as below). The expectation is the same MAE value would be returned as above. Unfortunately I could not get it right using the 3 approaches below.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)
#Approach 1:
print(grid.best_score_)
# Output:
# -0.508130081300813
#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 0. 0. ... 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output:
# ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
Discussion:
Approach 1:
As in GridSearchCV() the scoring variable is set to neg_mean_squared_error, tried to read the grid.best_score_. But it did not get the same MAE result.
Approach 2:
Tried to get the y_predict values using grid.predict(X_valid). Then tried to get the MAE using grid.score(y_predict, y_valid) as the scoring variable in GridSearchCV() is set to neg_mean_squared_error. It returned a ValueError complaining "Expected 2D array, got 1D array instead".
Approach 3:
Tried to reshape y_predict and it did not work either. This time it returned "ValueError: Number of features of the model must match the input."
It would be helpful if you can assist to point where I could have made the error?
If you need, the data.csv is available at https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv
Thank you very much

You are trying to compare mean_absolute_error with neg_mean_squared_error which is very different refer here for more details. You should have used neg_mean_absolute_error in your GridSearchCV object creation like shown below:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
Also, the score method in sklearn takes (X,y) as inputs, where x is your input feature of shape (n_samples, n_features) and y is the target labels, you need to change your grid.score(y_predict, y_valid) into grid.score(X_valid, y_valid).

ValueError : x and y must be the same size

I have a dataset which i'm trying to calculate Linear regression using sklearn.
The dataset i'm using is already made so there are not suppose to be problems with it.
I have used train_test_split in order to split my data into train and test groups.
When I try to use matplotlib in order to create scatter plot between my ttest and prediction group, I get the next error:
ValueError: x and y must be the same size
This is my code:
y=data['Yearly Amount Spent']
x=data[['Avg. Session Length','Time on App','Time on Website','Length of Membership','Yearly Amount Spent']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
#training the model
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm.fit(x_train,y_train)
lm.coef_
predictions=lm.predict(X_test)
#here the problem starts:
plt.scatter(y_test,predictions)
Why does this error occurs?
I have seen previous posts here and the suggestions for this was to use x.shape and y.shape but i'm not sure what is the purpose of that.
Thanks

It seems that you are using the EcommerceCustomers.csv dataset (link here)
In your original post the column 'Yearly Amount Spent' is also included in the y as well as in x but this is wrong.
The following should work fine:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv("EcommerceCustomers.csv")
y = data['Yearly Amount Spent']
X = data[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# ## Training the Model
lm = LinearRegression()
lm.fit(X_train,y_train)
# The coefficients
print('Coefficients: \n', lm.coef_)
# ## Predicting Test Data
predictions = lm.predict( X_test)
See also this

Why can't I match LGBM's cv score?

I'm unable to match LGBM's cv score by hand.
Here's a MCVE:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
folds = KFold(5, random_state=42)
params = {'random_state': 42}
results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])
print('LGBM\'s cv score: ', results['auc-mean'][-1])
clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))
val_scores = []
for train_idx, val_idx in folds.split(X_train):
clf.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
val_scores.append(roc_auc_score(y_train.iloc[val_idx], clf.predict_proba(X_train.iloc[val_idx])[:,1]))
print('Manual score: ', np.mean(np.array(val_scores)))
I was expecting the two CV scores to be identical - I have set random seeds, and done exactly the same thing. Yet they differ.
Here's the output I get:
LGBM's cv score: 0.9851513530737058
Manual score: 0.9903622177441328
Why? Am I not using LGMB's cv module correctly?

You are splitting X into X_train and X_test.
For cv you split X_train into 5 folds while manually you split X into 5 folds. i.e you use more points manually than with cv.
change results = lgb.cv(params, lgb.Dataset(X_train, y_train) to results = lgb.cv(params, lgb.Dataset(X, y)
Futhermore, there can be different parameters. For example, the number of threads used by lightgbm changes the result. During cv the models are fitted in parallel. Hence the number of threads used might differ from your manual sequential training.
EDIT after 1st correction:
You can achieve the same results using manual splitting / cv using this code:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
folds = KFold(5, random_state=42)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective':'binary',
'metric':'auc',
}
data_all = lgb.Dataset(X_train, y_train)
results = lgb.cv(params, data_all,
folds=folds.split(X_train),
num_boost_round=1000,
early_stopping_rounds=100)
print('LGBM\'s cv score: ', results['auc-mean'][-1])
val_scores = []
for train_idx, val_idx in folds.split(X_train):
data_trd = lgb.Dataset(X_train.iloc[train_idx],
y_train.iloc[train_idx],
reference=data_all)
gbm = lgb.train(params,
data_trd,
num_boost_round=len(results['auc-mean']),
verbose_eval=100)
val_scores.append(roc_auc_score(y_train.iloc[val_idx], gbm.predict(X_train.iloc[val_idx])))
print('Manual score: ', np.mean(np.array(val_scores)))
yields
LGBM's cv score: 0.9914524426410262
Manual score: 0.9914524426410262
What makes the difference is this line reference=data_all. During cv, the binning of the variables (refers to lightgbm doc) is constructed using the whole dataset (X_train) while in you manual for loop it was built on the training subset (X_train.iloc[train_idx]). By passing the reference to the dataset containg all the data, lightGBM will reuse the same binning, giving same results.

Cross Validation Python Sklearn

I want to do Cross Validation on my SVM classifier before using it on the actual test set. What I want to ask is do I do the cross validation on the original dataset or on the training set, which is the result of train_test_split() function?
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X, y, cv=kfold) #Cross validation on original set
or
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X_train, y_train, cv=kfold) #Cross validation on training set

It is best to always reserve a test set that is only used once you are satisfied with your model, right before deploying it. So do your train/test split, then set the testing set aside. We will not touch that.
Perform the cross-validation only on the training set. For each of the k folds you will use a part of the training set to train, and the rest as a validations set. Once you are satisfied with your model and your selection of hyper-parameters. Then use the testing set to get your final benchmark.
Your second block of code is correct.

Support vector machine overfitting my data

I am trying to make predictions for the iris dataset. I have decided to use svms for this purpose. But, it gives me an accuracy 1.0. Is it a case of overfitting or is it because the model is very good? Here is my code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1,gamma='auto')
svm_model.fit(X_train,y_train)
predictions = svm_model.predict(X_test)
accuracy_score(predictions, y_test)
Here, accuracy_score returns a value of 1. Please help me. I am a beginner in machine learning.

You can try cross validation:
Example:
from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
#load iris data
iris = datasets.load_iris()
X = iris.data
Y = iris.target
#build the model
svm_model = SVC( kernel ='linear', C = 1, gamma = 'auto',random_state = 0 )
#create the Cross validation object
loo = LeaveOneOut()
#calculate cross validated (leave one out) accuracy score
scores = cross_val_score(svm_model, X,Y, cv = loo, scoring='accuracy')
print( scores.mean() )
Result (the mean accuracy of the 150 folds since we used leave-one-out):
0.97999999999999998
Bottom line:
Cross validation (especially LeaveOneOut) is a good way to avoid overfitting and to get robust results.

The iris dataset is not a particularly difficult one from where to get good results. However, you are right not trusting a 100% classification accuracy model. In your example, the problem is that the 30 test points are all correctly well classified. But that doesn't mean that your model is able to generalise well for all new data instances. Just try and change the test_size to 0.3 and the results are no longer 100% (it goes down to 97.78%).
The best way to guarantee robustness and avoid overfitting is using cross validation. An example on how to do this easily from your example:
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X = iris.data[:, :4]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1, gamma='auto')
scores = cross_val_score(svm_model, iris.data, iris.target, cv=10) #10 fold cross validation
Here cross_val_score uses different parts of the dataset as testing data iteratively (cross validation) while keeping all your previous parameters. If you check score you will see that the 10 accuracies calculated now range from 87.87% to 100%. To report the final model performance you can for example use the mean of the scored values.
Hope this helps and good luck! :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can I standardize all data before doing cross validation? - python

Related

MAE using Pipeline and GridSearchCV

ValueError : x and y must be the same size

Why can't I match LGBM's cv score?

Cross Validation Python Sklearn

Support vector machine overfitting my data

Categories

Resources