Get row indexes of H2O dataframe when split into train and test

Get row indexes of H2O dataframe when split into train and test - python

I am new to H2O. So far for the train-test split I have used the StratifiedKFold() of sklearn.
skf = StratifiedKFold(n_splits=n, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I need the indexes for some further processing later.
In H2O I can't figure out how to get the indexes while doing cross validation. From what I have gathered via videos and blogs, this is how we do CV in H2O:
gbm_model = H2OGradientBoostingEstimator(model_id = 'gbm_model',nfolds=5)
How do I get the train and test indexes of each fold?
Also, how do I get the indexes while doing a simple split?
data_split = data.split_frame(ratios=[0.8],seed = 1234)
train_df = data_split[0]
test_df = data_split[1]
How do I get the indexes that went into train and test?

you could use stratified_kfold_column(n_folds=3, seed=-1) or stratified_split(test_frac=0.2, seed=-1) which create a column with the splits you can use to subset to split on later.
see more about these in the docs

Related

Python (sklearn) train_test_split: choosing which data to train and which data to test

I want to use sklearn's train_test_split to manually split data into train and test categories. Specifically, in my .csv file, I want to use all the rows of data until the last row to train, and the last row to test. The reason I'm doing this is because I need to launch a machine learning model but am incredibly short on time. I thought the best way would be to use predictions rather than deploying it using IBM Watson. I don't need it to be live. My code so far looks like this:
df=pd.read_csv('Book5.csv', names=['Amiability', 'Email'])
from sklearn.model_selection import train_test_split
df_x = df['Amiability']
df_y = df['Email']
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Then,
len(df)
Produces
331
I want to train with rows 0-330, and test with row 331. How can I do this?

If you don't absolutely need the test row to be the last row you should be able to do:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=1, random_state=4)
When test_size= is an integer it specifies the absolute number of sample rows for the test set.

I want to compare my prediction value with original train data

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.

Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

key error not in index while cross validation

I have applied svm on my dataset. my dataset is multi-label means each observation has more than one label.
while KFold cross-validation it raises an error not in index.
It shows the index from 601 to 6007 not in index (I have 1...6008 data samples).
This is my code:
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {} '.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])
Actually, I do not know how to apply KFold cross-validation in which I can get the F1 score and accuracy of each label separately.
having looked at this and this did not help me how can I successfully to apply on my case.
for being reproducible, this is a small sample of the data frame
the last seven features are my labels including ADR, WD,...
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1
Update
when I did whatever Vivek Kumar said It raises the error
ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]
in classifier part . do you have any idea how to resolve it?
there are a couple of links for this error in stackoverflow which says I need to reshape training data. I also did that but no success link
Thanks :)

train_index, test_index are integer indices based on the number of rows. But pandas indexing dont work like that. Newer versions of pandas are more strict in how you slice or select data from them.
You need to use .iloc to access the data. More information is available here
This is what you need:
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
...
...
# TfidfVectorizer dont work with DataFrame,
# because iterating a DataFrame gives the column names, not the actual data
# So specify explicitly the column name, to get the sentences
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])

How can I do K fold cross-validation for splitting the train and test set?

I have a set of documents and a set of labels.
Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
When I try the method provided in the documentation of scikit learn: I receive an error that says:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
error
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
How can I perform a 10 fold cross-validation on my documents and labels?

There are two ways to solve this error:
First way:
Cast your data to a numpy array:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
then it should work with your current code.
Second way:
Use list comprehension to index the train & label list with the train_index & test_index list
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(For this solution also see related question index list with another list)

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.

[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.

TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.

In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data

StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]

As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get row indexes of H2O dataframe when split into train and test - python

you could use stratified_kfold_column(n_folds=3, seed=-1) or stratified_split(test_frac=0.2, seed=-1) which create a column with the splits you can use to subset to split on later. see more about these in the docs

Related

Python (sklearn) train_test_split: choosing which data to train and which data to test

I want to compare my prediction value with original train data

key error not in index while cross validation

How can I do K fold cross-validation for splitting the train and test set?

Stratified Train/Test-split in scikit-learn

Categories

Resources