Splitting of training and validation dataset - python

I need to split my training data (80-20) into validation data in a way that the split sub-datasets are not random but always the same.
Presently I use this code
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)
but the split sub-datasets are always random and never the same. I want it to be random but the same value should be present when I run the code again ( something like np.random.seed)
Is there a way to do that?

train_test_split() has a random_state argument. If you assign to it an integer value the result will be always the same:
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=1)

Related

Machine Learning Using Scikit-Learn & SVM

Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?
You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?
If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])
To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.
use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

Cross Validation Python Sklearn

I want to do Cross Validation on my SVM classifier before using it on the actual test set. What I want to ask is do I do the cross validation on the original dataset or on the training set, which is the result of train_test_split() function?
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X, y, cv=kfold) #Cross validation on original set
or
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X_train, y_train, cv=kfold) #Cross validation on training set
It is best to always reserve a test set that is only used once you are satisfied with your model, right before deploying it. So do your train/test split, then set the testing set aside. We will not touch that.
Perform the cross-validation only on the training set. For each of the k folds you will use a part of the training set to train, and the rest as a validations set. Once you are satisfied with your model and your selection of hyper-parameters. Then use the testing set to get your final benchmark.
Your second block of code is correct.

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test
You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9
Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.
In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data
StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]
As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)
Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Categories

Resources