Preparing CSV file data for Scikit-Learn Using Pandas? - python

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?
Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?
I've imported my file using:
dataset = pd.read_csv('example.csv', header=None, sep=',')
Thanks

I'd recommend using sklearn's train_test_split
from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)

You can try this.
Sperating target class from the rest:
pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]
Now to create test and training samples:
I would just use numpy's randn:
mask = np.random.rand(len(pixel_values )) < 0.8
train = pixel_values [mask]
test = pixel_values [~msk]
Now you have traning and test samples in train and test with 80:20 ratio.

You can simply do:
choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]
Then, to pass it as x and y to Scikit-Learn:
scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])
Let me know if this doesn't work.

Related

Collate probabilities, predictions, coefficients from multiple samples of data using sklearn

I would like to combine model probabilities for class 1 predictions for ALL rows from multiple (random) splits/samples of data into a single dataframe in python.
I realize that not all rows will be selected in each split, but if data sampling is replicated enough times, each row will have been selected a few times at least and model probabilities generated.
My current approach basically creates multiple test-train splits (5 in example below), and collates probabilities from each training instance into a single dataframe as shown in below code with a mock dataset:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
# start by creating the first column of probs table
probs_table = pd.DataFrame(X.index, columns=["members"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
pd.DataFrame(log.predict_proba(test_x)[:, 1]) #fit final model
probs_table[f"iter_{i+1}"] = pd.DataFrame(log.predict_proba(test_x)[:, 1])
probs_table
Unfortunately, I am not getting probabilities for all rows in the dataframe. Can somebody please guide me to the solution to this problem? And it would be ideal to include additional model outputs such as predictions, coefficientts for each iteration/data row.
Any other way to sample the data (i.e., other than test-train splitting) is fine as well as long as probabilities can be assembled for all dataframe rows.
There are a couple problems with the code as is:
.fit() is never called here. I'm assuming you'd like it fit right after the train/test split line and before the predict_proba() call?
When you place the values into the dataframe, you're creating a new column and I assume you want one column for all iterations while keeping track of which iteration it came from in each column?
Here is code that I believe accomplishes what you'd like. It 1) loops over each random state integer, 2) creates a new train/test split, 3) fits a new model each time, and 4) predicts on each test set row.
I also have it keep track of the original index so you can see how many times each original row ends up in the prediction data frame:
EDIT: Include the coefficients as a column
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
dfs = []
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
log.fit(train_x, train_y)
preds = log.predict_proba(test_x)[:, 1]
orig_indices = test_x.index
df = pd.DataFrame(data={
"orig_index": orig_indices,
"prediction": preds,
"iteration": f"iter_{i+1}",
"coefficients": [log.coef_[0]] * len(preds)})
dfs.append(df)
probs_table = pd.concat(dfs)
probs_table

How can we give explicit test data and train data to SVM instead of using train_test_split function?

I'm planning to provide the test and train datasets explicitly to the algorithm and not to use the train_test_split method for random splitting of the data into test and train respectively.
And I want to keep the reviews and labels data in the same file while testing as well as training the model.
Can anyone of you please suggest me regarding the same ...
My code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix
with open("/Users/xyz/Desktop/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)
lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)
print("Accuracy score of SVM:" , accuracy_score)
Test.txt
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
Train.txt:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
Just do what you want. The solution is pretty easy:
X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]
Do the same for y_train and y_test.
Of course you need to know which rows in your file are for training and which are for testing.
If you want to keep features and labels in the same file - no problem with that. You will need one additional step to separate labels from features. It would be a lot easier with pandas.
EDIT
Having the files you provided you can get what you want like this:
def load_data(filename):
X = list()
y = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
y.append(line[1])
X.append(line[0].split())
return X, y
X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')

scikit learn logistic regression model tfidfvectorizer

I am trying to create a logistic regression model using scikit learn with the code below. I am using 9 columns for the features (X) and one for the label (Y). When trying to fit I get an error "ValueError: Found input variables with inconsistent numbers of samples: [9, 560000]" even though previously the lengths of X and Y are the same, if I use x.transpose() i get a different error "AttributeError: 'int' object has no attribute 'lower'". I am assuming this has to do with the tfidfvectorizer possibly, I am doing this because 3 of the columns contain single words and wasn't working. Is this the right way to be doing this or should I be converting the words in the columns separately and then using train_test_split? If not why am I getting the errors and how can I fic them. Heres an example of the csv.
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation =
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
What you are trying to do is unusual because TfidfVectorizer is designed to extract numerical features from text. But if you don't really care and just want to make your code works, one way to do it is by converting your numerical data to string and configure TfidfVectorizer to accept tokenized data:
import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
# convert all columns to string like we don't care
for col in my_df.columns:
my_df[col] = my_df[col].astype(str)
# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
my_df.loc[:, col].fillna('', inplace=True)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
x_data.values, Y.values, test_size=0.2, random_state=7)
# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None)
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
That being said, I'd recommend you to use another method to do feature engineering on your dataset. For example, you can try to encode your nominal data (eg. IP, port) to numerical value.

python pandas split to testing and training data [duplicate]

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
I would just use numpy's randn:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Pandas random sample will also work
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.
I would use scikit-learn's own training_test_split, and generate it from the index
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
And if you want to split x from y
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
And if you want to split the whole df
X, y = df[list_of_x_cols], df[y_col]
There are many ways to create a train/test and even validation samples.
Case 1: classic way train_test_split without any options:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
You can use below code to create test and train samples :
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
If you need to split your data with respect to the lables column in your data set you can use this:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
To split into more than two classes such as train, test, and validation, one can do:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put approximately 70% of data in training, 15% in test, and 15% in validation.
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
Just select range row from df like this
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
if you want to split it to train, test and validation set you can use this function:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
You can make use of df.as_matrix() function and create Numpy-array and pass it.
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution
First, assign a unique id to a dataframe (if already not exist)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
Here are my split numbers:
train = 120765
test = 4134
dev = 2816
The split function
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
Now splitting into train, test, dev
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
The sample method selects a part of data, you can shuffle the data first by passing a seed value.
train = df.sample(frac=0.8, random_state=42)
For test set you can drop the rows through indexes of train DF and then reset the index of new DF.
test = df.drop(train_data.index).reset_index(drop=True)
How about this?
df is my dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
I would use K-fold cross validation.
It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.
Method 1: Split rows into train, validate, test dataframes.
train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
Method 2: Split rows when validate must be subset of train (fastai)
train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
That's what I do:
train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
I would just use numpy's randn:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Pandas random sample will also work
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.
I would use scikit-learn's own training_test_split, and generate it from the index
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
And if you want to split x from y
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
And if you want to split the whole df
X, y = df[list_of_x_cols], df[y_col]
There are many ways to create a train/test and even validation samples.
Case 1: classic way train_test_split without any options:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
You can use below code to create test and train samples :
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
If you need to split your data with respect to the lables column in your data set you can use this:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
To split into more than two classes such as train, test, and validation, one can do:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put approximately 70% of data in training, 15% in test, and 15% in validation.
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
Just select range row from df like this
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
if you want to split it to train, test and validation set you can use this function:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
You can make use of df.as_matrix() function and create Numpy-array and pass it.
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution
First, assign a unique id to a dataframe (if already not exist)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
Here are my split numbers:
train = 120765
test = 4134
dev = 2816
The split function
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
Now splitting into train, test, dev
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
The sample method selects a part of data, you can shuffle the data first by passing a seed value.
train = df.sample(frac=0.8, random_state=42)
For test set you can drop the rows through indexes of train DF and then reset the index of new DF.
test = df.drop(train_data.index).reset_index(drop=True)
How about this?
df is my dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
I would use K-fold cross validation.
It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.
Method 1: Split rows into train, validate, test dataframes.
train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
Method 2: Split rows when validate must be subset of train (fastai)
train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
That's what I do:
train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

Categories

Resources