Equivalent of R's createDataPartition in Python - python

I am trying to reproduce the behavior of the R's createDataPartition function in python. I have a dataset for machine learning with the boolean target variable. I would like to split my dataset in a training set (60%) and a testing set (40%).
If I do it totally random, my target variable won't be properly distributed between the two sets.
I achieve it in R using:
inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]
How can I do the same in Python?
PS : I am using scikit-learn as my machine learning lib and python pandas.

In scikit-learn, you get the tool train_test_split
from sklearn.cross_validation import train_test_split
from sklearn import datasets
# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'],
table['Food Choice'],
# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)
This breaks the data in to
inputs for training
inputs for the evaluation data
output for the training data
output for the evaluation data
respectively. You can also add a keyword argument: test_size=0.25 to vary the percentage of the data used for training and testing
To split a single dataset, you can use a call like this to get 40% test data:
>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
>>> print len(training)
>>> print len(testing)

The correct answer is sklearn.model_selection.StratifiedShuffleSplit
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data into train/test sets.
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

The answer provided is not correct. Apparently there is no function in python that can do stratified sampling, not random sampling, like DataPartition in R does.

As mentioned in the comments, the selected answer does not preserve the class distribution of the data. The scikit-learn docs point out that if is required, then the StratifiedShuffleSplit should be used. This can be done with the train_test_split method with by passing your target array to the stratify option.
>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))


Split the dataset in train and test based on the group value?

If I have the following dataset: (If I group by the data with 'group_name', the data will look like:)
I want to split the dataset into train and test set based on the **group_name** feature. For example, if I want 80:20 ratio, then the train and test set will look like (i.e. in the group-by function):
Train Set:
Test Set:
Thus, the 80:20 ratio is considered in the above example. Also, the above examples shown are the results of the groupby function applied to the actual dataset.
Get the training with
training = df.groupby('group_name').apply(lambda x: x.sample(frac=0.8))
Then get the testing with the other index
testing = df.loc[set(df.index) - set(training.index.get_level_values(1))]
Hey sklearn has many types of train test split, one of them is stratifysplit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = df['column_to_groupby'], random_state=0)
it will split each group based on test_size you defined.
just try the sklearn lib. Based on your purpose, following methods could be try. They are groupkflod, or GroupShuffleSplit.
Here is the example for groupkfold :
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
for more detail, Visualizing cross-validation behavior in scikit-learn article could provide detail info. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py.

train_test_split( ) method of scikit learn

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.
What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.
Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
# array([[4, 5],
# [0, 1],
# [6, 7]])
# [2, 0, 3]
# array([[2, 3],
# [8, 9]])
# [1, 4]
You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.
Providing a value to random state will be helpful in reproducing the same values in the split when you re-run the program.
If you don't provide any value to the random state, we will get different set of values for test and train after each run. In such a case, if you encounter any error, then it will not be helpful in debugging.
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("diabetes.csv")
Loop with random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
Note the data is the same for both iterations
Loop without random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
Note the data is not the same for both iterations
If you run the code, and see the output, you will see when random_state is the same, it will provide the same train / test set, but when random_state is not provided, the set of values in test / train is different each time.
If you don't specify random_state every time you execute your code you will get a different (random) split. Instead if you give a random_state value the split will always be the same. It is often used for experiments reproducibility.
For example:
X = [[1,5],[2,6],[3,2],[4,7], [5,5], [6,2], [7,1],[8,6]]
y = [1,2,3,4,5,6,7,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X, y, test_size=0.33, random_state=324)
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train_rs, X_test_rs, y_train_rs, y_test_rs))
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train, X_test, y_train, y_test))
If you run this code different times you can see that the splits without random state change at every run.
As explained in the sklearn documentation, random_state can be an integer if you want specify the random number generator seed (the most frequent case), or directly an instance of RandomState class.
random_state argument is just to seed random order. if you give different random_state it will split dataset in different order. if you provide same random_state every time then split will be same. dataset will split in same order.
If you want your dataset to split in same order every time then provide same random_state.

Inverse of prediction is correct in Scikit Learn Logistic Legression

In the following minimal reproducible dataset, I split a dataset into train and test dataset, fit a logistic regression to the training dataset with scikit learn and predict y based on the x_test.
However the y_pred or y predictions, are correct only if inversed (e.g 0 = 1, and 1 = 0) calculated like so: 1 - y_pred.
Why is this the case? I cant figure out if it is something relating to the scaling of x (I have tried with and without the StandardScaler), something related to the logistic regression, or the accuacy score calculation.
In my larger dataset, this is also the case even when using different seeds as random state. I have also tried this Logistic Regression with the same result.
EDIT as pointed out by #Nester it works without standard scaler for this minimal dataset. Larger dataset avaliable here, standardScaler does nothing on this larger dataset, I'll keep the OP smaller dataset as it might help in explaining the problem.
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# small dataset
Y = [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
X =[[0.38373581],[0.56824121],[0.39078066],[0.41532221],[0.3996311 ]
,[0.3455455 ],[0.55867358],[0.51977073],[0.51937625],[0.48718916]
,[0.37019272],[0.49478954],[0.37277804],[0.6108499 ],[0.39718093]
,[0.33776591],[0.36384773],[0.50663667],[0.3247984 ]]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=42, stratify=Y)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred = 1 - y_pred # <- why?
Larger dataset accuracy:
0.7 # if inversed
thanks for reading
X and Y does not have any relationship at all. Hence, the model is performing poorly. There is reason to say that 1-pred is performing better. If you have more than two classes, then situation would be even more worse.
%matplotlib inline
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, stratify=Y)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(x_train, y_train)
import matplotlib.pyplot as plt
The relationship is same for your bigger dataset as well.
Try to identify other features, which can help you in predicting Y.
Have you tried running the model without the StandardScaler()? Your data looks like it doesn't need to be re-scaled.

Unexpected cross-validation scores with scikit-learn LinearRegression

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)

Memory efficient way to split large numpy array into train and test

I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?
The follow code results in a memory error and causes a crash
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.random.random((10000,70000))
Y = np.random.random((10000,))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)
One method that I've tried which works is to store X in a pandas dataframe and shuffle
X = X.reindex(np.random.permutation(X.index))
since I arrive at the same memory error when I try
Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split
#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
ratio = int(matrix.shape[0]/test_proportion) #should be int
X_train = matrix[ratio:,:]
X_test = matrix[:ratio,:]
Y_train = target[ratio:,:]
Y_test = target[:ratio,:]
return X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)
This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?
Another way to use the sklearn split method with reduced memory usage is to generate an index vector of X and split on this vector. Afterwards you can select your entries and e.g. write training and test splits to the disk.
import h5py
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.random.random((10000,70000))
Y = np.random.random((10000,))
x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)
# Write
f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
# Read
f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
I came across a similar problem.
As mentioned by #user1879926, I think shuffle is a main cause of memory exhaustion.
And ,as 'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split cited,
train_test_split in sklearn 0.19 has option disabling shuffle.
So, I think you can escape from memory error by just adding shuffle=False option.
I faced the same problem with my code. I was using a dense array like you and ran out of memory. I converted my training data to sparse (I am doing document classification) and solved my issue.
I suppose a more "memory efficient" way would be to iteratively select instances for training and testing (although, as is typical in computer science, you sacrifice the efficiency inherent in using matrices).
What you could do is iterate over the array and, for each instance, 'flip a coin' (use the random package) to determine whether you use the instance as training or testing and, depending upon which, storing the instance in the appropriate numpy array.
This iterative method shouldn't be bad for only 10000 instances. What is curious though is that 10000 X 70000 isn't all that large; what type of machine are you running? Makes me wonder whether it is a Python/numpy/scikit issue or a machine issue...
Anyway, hope that helps!

